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Abstract 

We introduce a geometric framework suitable for studying the re- 
lationships among biological sequences. In contrast to previous works, 
our formulation allows asymmetric distances (quasi-metrics) , originat- 
ing from uneven weighting of strings, which may induce non-trivial 
partial orders on sets of biosequences. The distances considered are 
more general than traditional generalized string edit distances. In 
particular, our framework enables non-trivial conversion between se- 
quence similarities, both local and global, and distances. Our con- 
structions apply to a wide class of scoring schemes and require much 
less restrictive gap penalties than the ones regularly used. Numerous 
examples are provided to illustrate the concepts introduced and their 
potential applications. 



1 Introduction 

Biological macromolecules such as DNA, RNA and proteins play an essen- 
tial role in all living organisms. Structurally, they are all chains of residues 
belonging to a small set of basic molecules and the functional characteris- 
tics of each macromolecule are determined by the order and composition of 
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its components. It is therefore not surprising that comparison and ahgn- 
ment of biological sequences is one of the most important contributions from 
computational biology to modern biosciences. 

Typical approaches to biosequence comparison are either distance- [671 182] 
or similarity-based [551 EH]. The distance-based approaches minimize the 
cost, while those based on similarity maximize the likelihood of transfor- 
mation of one sequence into another. In both cases the comparison scores 
for sequences are obtained by extension from scores over alphabets of ba- 
sic molecules. The algorithms for computation of alignments are based on 
the dynamic programming technique [1]. Similarity-based methods became 
widely accepted because the Smith- Waterman algorithm [69] allows compu- 
tation of local alignments, involving only parts of sequences to be compared. 
Local alignments are highly appropriate in biological context because ele- 
ments of structure and function are usually restricted to discrete regions of 
biosequences and hence strong similarity of fragments of two sequences need 
not extend to similarity of full sequences. Most distance methods have been 
global in nature and could not be easily adapted for local comparison. 

A downside of using local similarities for sequence comparison is that, 
while their statistics can be characterized [381 B2] . no constraints, apart from 
algorithmic ones, are placed on the form that similarity measures can take. 
Under such conditions, sets of biosequences with similarity measures cannot 
be identified with mathematical structures such as metric or normed spaces, 
which are a natural framework for many computational techniques such as 
clustering |83j and indexing for similarity search ^30j. In contrast, distance 
measures on sequences naturally correspond to metrics under some mild re- 
strictions. 

While the duality between global similarities and distances has been rec- 
ognized very early [7D|, it was only recently established independently by 
Stojmirovic [73] and by Spiro and Macura [71j that it is possible to transform 
local sequence similarity scores derived from many popular scoring functions 
on building blocks of DNA and proteins into distances satisfying the triangle 
inequality. In the contexts in which they were presented, the results of the 
above two papers are almost equivalent, however, their perspectives are quite 
different. Spiro and Macura ^TJJ assume symmetric similarity scores and con- 
sider the transformation which converts a similarity to a metric, while [73] 
converts similarity into a quasi-metric, a metric without the symmetry ax- 
iom. Quasi-metrics naturally correspond to partial orders and are therefore 
a natural framework for local similarities. 



2 



Unlike most existing literature entries, which are concerned with align- 
ment algorithms, this paper aims to show a rigorous connection between 
similarities and distances that are metrics or quasi-metrics. Our main re- 
sults are presented in a form that allows transfer to domains that are not 
necessarily related to classical string transformations and for that reason we 
use the framework of free semigroups. We define the i^-tjpe edit distance, 
which generalizes the regular edit distance and allows us to consider many 
more scoring functions on the amino acid alphabet that fail the requirements 
in [73j and ^TJJ . Our results also allow for similarities and distances that are 
asymmetric. In order to have an accurate description of distances generated 
from similarities, we introduce a novel nomenclature. 

Section [2] presents the basic definitions. Edit distances and global sim- 
ilarities are discussed in Sections [3] and HI respectively. Our main result. 
Theorem 15.31 is presented in Section [5] and various kinds of local similarities 
are discussed as examples. Section [6] examines the applicability of our the- 
ory to the actual similarity measures used in contemporary computational 
biology, while Section [7] discusses some possible applications of our results 
and future directions. We chose to state many of the well-known results for- 
mally and to present many examples to enhance readability. The proofs of 
the established results are either omitted, or, when generalized in our new 
framework, relegated to Appendix |X1 

2 Preliminaries 

2.1 Sequences and Free Semigroups 

Recall that the free monoid on a nonempty set S, denoted S*, is the monoid 
whose elements, called words or strings, are all finite sequences of zero or 
more elements from S, with the binary operation of concatenation. The 
unique sequence of zero letters (empty string), which we shall denote e, is 
the identity element. The free semigroup on S, denoted E"*" is the subset of 
S* containing all elements except the identity. 

The length of a word w G S*, denoted \w\, is the number of occurrences 
of members of S in it. For w = o"ia"2 . . . c^, where cTj G S, \w\ = n and we 
set |e| =0. 

For two words u,v & S*, m is a factor or substring of if = xuy for some 
X, y G S* and m is a subsequence or subword of f if = wlulw2U2 ■ ■ ■ w^u^w^_^_i, 
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where u = ulu^ •••«*, u* e S* and w* G S*. For any x G S*, we use ^{x) to 
denote the set of all factors of x. 

We call a semigroup (monoid) {X, -k) free if it is isomorphic to the free 
semigroup (monoid) on some set S. The unique set of elements of X mapping 
to S under the isomorphism is called the set of free generators. 

Example 2.1. A DNA molecule can be represented as a word in the free 
semigroup generated by the four-letter nucleotide alphabet S = {A,T,C,G}. 
An RNA molecule is a word in the free semigroup generated by the alphabet 
S = {A, U, C, G}. A protein can be thought of as a word in the free semigroup 
generated by the standard twenty amino acid alphabet. 

Example 2.2. Let S be a set and denote by M(S) the set of all finite 
measures supported on E. We will call the elements of the free monoid 
M(S)* profiles over S*. Profiles arise as models of sets of structurally related 
biological sequences where S is the nucleotide or amino acid alphabet. 

As a convention, for any word u G S*, the notation u = U1U2 ■ ■ - Un, 
where n = \u\ shall mean that lij G S while the notation u = ulu2 . . . u*^ 
shall imply that u* G E*. For all 1 < A; < |m| we shall use to denote the 
word U1U2 ■ ■ - Uk and set Uq = e. 

Let / : S ^ M. The canonical homomorphic extension of f to the free 
monoid S* is a function / : S* ^ M such that /(e) = and for all x G 

/» = E!i/(^.)- 

2.2 Quasi- metrics 

Quasi-metrics are asymmetric distance functions that generalize metrics and 
partial orders. With their associated structures, they belong to an area of 
active research in topology and theoretical computer science |13]. We now 
produce the standard definitions used in the remainder of this paper. 

A quasi-metric on a set X is a mapping d : X x X ^ M+ such that for 
all x,y,z & X: 

(i) d{x,y) = d{y,x) = <^=^ = y, and 

(ii) d{x, z) < d{x, y) + d{y, z). 

The axiom (ii) is known as the triangle inequality. If in addition d is sym- 
metric, that is d{x,y) = d{y,x) for all x,y & X, then d is called a metric. 
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A pair {X,d), where X is a set and d a (quasi-) metric, is called a (quasi-) 
metric space. 

For a quasi- metric d, its conjugate (or dual) quasi-metric, denoted d*, is 
defined on X x X by d*{x,y) = d{y,x), and its associated metric, denoted 
d^, by d^{x,y) = max{d{x , y) , d{y , x)} = d{x,y) \/ d*{x,y). Another fre- 
quently used symmetrization of a quasi- metric is the 'sum' metric c?" defined 
by (i"(x, y) = d{x, y) + d{y, x). 

A (left) open ball of radius r > centered at xq G X with respect to a 
quasi-metric d is the set {x E X : d{xo,x) < r} . The collection of all (left) 
open balls centered at any x E X with any r > is a base for a topology 
on X induced by d. This topology is in general Tq but not necessarily Ti. 
For the purpose of this paper, we will call a quasi-metric d separating if the 
induced topology is Ti, that is, if d{x,y) = implies x = y for all x,y E X. 
Every quasi-metric d also has its associated partial order, denoted <d, defined 

X <dy <^=^ d{x,y) = 0. 

A quasi-metric d is called a weightable quasi-metric [44J if there exists 
a function w : X —>■ called the weight function or simply the weight, 
satisfying for every x,y E X 

d{x, y) + w{x) = d{y, x) + w{y). 

In this case we call d weightable by w. A quasi-metric d is co-weightahle if 
its conjugate quasi-metric d* is weightable. The weight function w by which 
d* is weightable is called the co-weight of d and d is co-weightahle by w. 

A concept strongly related to weighted quasi-metrics is that of a partial 
metric ^50j. A partial metric on a set X is a mapping p : X x X — >■ M_(. such 
that for all x,y, z E X: 

(i) Pix,y) > p{x,x); 

(ii) x = y p{x,x) = p{y,y) = p{x,y); 

(iii) p{x,y) = p{y,x); 

(iv) p{x, z) < p{x, y) + p{y, z) - p{y, y). 

It has been shown [50] that there is a bijection between the partial metrics and 
generalized weighted quasi-metrics: the transformation d{x,y) = p{x,y) — 
p{x, x) produces a generalized weighted quasi-metric with weight function 
X p{x, x) out of a partial metric while the p{x, y) = q{x, y)-\-w{x) produces 
a partial metric out of a generalized weighted quasi-metric. 
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3 Edit distance 




Waterman, Smith and Beyer, in their 1976 paper [82j|, introduced a general 

form of the edit distance on sets of words, henceforth referred to as the 
WSB distance. It was constructed by defining a set of allowed weighted 
transformations between two strings and then minimizing the sum of weights 
of allowed operations transforming (in the sense of ordered composition) one 
word into another. They also proposed an algorithm to compute the WSB 
distance based on dynamic programming. 

In this section, we present a recursive definition of edit distance on a 
free semigroup that generalizes that of Waterman, Smith and Beyer and de- 
scribe some of its most important properties. The edit distance provides the 
conceptual and algorithmic foundation to both global and local similarities 
on free semigroups. Before producing the main definition, we formalize the 
concept of a gap penalty, which we will discuss in detail later in the text. 

Definition 3.1. Let S be a set. A positive function 7 : S+ ^ M is called a 
gap penalty over S"*" if for all u,v & S"*", 



We denote by r(S) the set of all gap penalties over E"*". 

Definition 3.2. Let S be a set, d : S x S ^ M, and a and (3 be functions 
S+ M such that a^, (3^ G r(S). Let x, 2/ € S* and let m = \x\ and n = \y\. 
Let 1 < p < 00 and define the distance D : S* x S* ^ R using the following 
recursion: 

(a) D{xo,yo) = D{e,e) = 0, 

(b) D{e,yj) = aijjj) for all 1 < j < n, 

(c) D{xi, e) = /3{xi) for all 1 < z < m, and 

(d) for all 1 < 2 < m and I < j < n 



7(m) + 7(t>) > 'y{uv). 



(1) 




min {DP{xi,yj^k) + a^{yj-k+i 

l<k<j 



%•)} 
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The edit distance between the sequences x and y (extending d, a and P), 
is then given by D{x, y) = D{xm, Vn)- 

Remark 3.3. We have assumed that q;^,/?^ G r(S) instead of just being 
positive functions in order to have D{e, x) = a{x) and D{x, e) = (3{x) for all 
X e S"*". For a general positive function a : S"*" — > R, the function 7, given 
recursively for all x G E"*" by 7(xi) = a'^{xi) and 

7(xi) = min {'jixi^k) + a^ixi_k+i ■■■Xi)} , (2) 

l<k<i 

will belong to r(S) and therefore 7^/^ can be used in definition of D instead 
of a. 

Remark 3.4. Also note that the distance D as defined does not extend 
d from E in the strict sense, that is, it is not necessarily true that for all 
a,b E S, D{a,b) = d{a,b). However, this statement does become correct if 
we additionally assume (i^(a, b) < I3^{a) + a^{b). 

Remark 3.5. The edit distance between x and y can be computed using 
dynamic programming algorithm of Waterman, Smith and Beyer [82]. Let 
D be an (m + 1) x (n + 1) matrix with rows and columns indexed from 
such that Do,o = and for alH = 1, 2 . . . m and j = 1, 2 ... n, Dj^o = Pi^i), 
Do,i = and 

Bij = min |Di_ij_i + d^^Xi, yj), 

i^i<i ^-^^'j--^ + •••%■)}, (3) 

min {Dj_fcj + p^ixi^k+i ■■■Xi)} >. 

l<k<i 

Then, we have D{x,y) = (Drn,nY^^- The original WSB distance is obtained 
when p = 1. 

3.1 Alignments 

From the recursive definition, it follows that the edit distance D{x,y) can 
be decomposed as the £^ sum of the distances of non-overlapping factors of 
X and y. This decomposition provides an optimal alignment between x and 

y- 
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Definition 3.6 ([SS])- Let x,y E S*. An alignment between x and y is a 

finite sequence of pairs {{x\, yk))f,^i^ wliere x = xlx2 ■ ■ ■ y = ylyl ■ ■ - y^ 
and for each 1 < k < K either 

(a) xl = Xi and yl = yj for some or 

(b) x'l G dix), x\^ e and y\ = e, or 

(c) xl = e,yle d{y) and yl ^ e. 

We will use A{x,y) to denote the set of all alignments of x and y. 

Each pair [xl, yl) corresponds to an edit operation that transforms xl into 
yl- Pairs of the form (a, 6), (x,e) and {e,y) where a, b G S and x,y E S"*" 
represent a substitution of the letter a for the letter b, deletion of the word 
X and insertion of the word y, respectively. Insertions and deletions are 
collectively called indels. 

Every transformation {xl,yl) can be given a weight or a cost equal to 

D{xl,yl), with the weight of an alignment {{xl,yl)')^_^ being equal to the 
£P sum of the weights of the individual transformations. The distance d on 
S provides substitution costs, while the values of a and /3, give the costs of 
indels. Thus, the edit distance between x and y can be described as the 
minimum weighted cost (in the sense) of transforming the sequence x 
into y using substitutions and indels as edit operations. This provides an 
alternative characterization of edit distance, which was long known for the 
£^ case [SB] and which we state here in general form without proof as Lemma 
13.71 below. 

Lemma 3.7. Let H be a set, d : H x H —>■ M., and a, (3 : S+ M+. Suppose 
D is an edit distance on S* with respect to d, a and (3. Then, for all 
x,y eT.* 



D{x, y) = min 




□ 



3.2 Edit distances as quasi-metrics 

We now proceed to state the conditions for an edit distance to be a 
quasi-metric. For simplicity we restrict ourselves to edit distances with gap 
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penalties that are increasing and depend solely on fragment composition and 
length, while more general gap penalties are considered in Appendix lA.ll 

Definition 3.8. Let S be a set. We call a function 7 : S* ^ M increasing if 
for all u,v,x & S*, 



Definition 3.9. Let S be a set. A function 7 G r(S) is called a composition- 
length gap penalty on S"^ if it is increasing and has a form 



for all z G where is a map S M and is a function N ^ M. We 
denote by FclI^) the set of all composition- length gap penalties on S+. 

Composition-length gap penalties have a component solely dependent on 
the length of the inserted or deleted word and a composition-dependent com- 
ponent. Current applications of edit distances in computational biology (see 
for example P5]) mainly use gap penalties that are the same for insertions 
and deletions and depend solely on the fragment length, thus satisfying our 
definition of composition-length gap penalties with = 0. We chose the 
above definition in order to include all such cases and to provide simple but 
sufficiently general gap penalties for consideration of global and local similar- 
ities. The requirement for composition-length gap penalties to be increasing 
is included because it is a necessary condition for applications of our main 
Theorem 15. 3[ 

The most widely used length-dependent gap penalty functions are linear, 
of the form = fJ-k, and ajfine, of the form ipik) = /i + J^k, where /i, 1/ are 
constants. The main advantage of affine gap penalties is that the dynamic 
programming algorithm for computation of distances in this case can be 
modified to run in 0{nm) average and worst case time, where m = \x\ 
and n = \y\ [21], as opposed to 0{m?n + mn^) for the most general WSB 
algorithm [82]. Gap penalties of the form il){k) = /^ + v\og{k) have also been 
considered [SI]. Note that the algorithmic complexity of the WSB algorithm 
for distances using composition-length gap penalties depends mainly on the 
form of -0 since the composition-dependent component is linear. 

Theorem 3.10. Let be a set and let 1 < p < 00 . Suppose d is a separating 
quasi-metric on S and 7, 5 G TclC^) such that for all a, & G 




(5) 




(6) 




(7) 
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and 



S{a)-6{b) <dPia,b). (8) 

Let a = 7"^/^ and (3 = S^''^. Then, the ff' edit distance D, extending d,a and 
(3, is a separating quasi-metric on T,* . □ 



Theorem 13.101 is a generalization of similar theorems for p = 1 proven 
by Waterman et al. [82] for constant substitution costs and gap penalties 
depending on fragment length, and by Spiro and Macura [71] in a more 
general setting. We state and prove a version with fewer restriction on gap 
penalties as Theorem lA.ll in Appendix lA.ll 

Remark 3.11. According to |64j, a quasi-metric d defined on a semigroup 
(X, -k) is called invariant with respect to -k if for all x,y, z & X, 



It is apparent from the definition that the edit distance D on the free semi- 
group E*, which satisfies Theorem 13. 101 is invariant with respect to the string 
concatenation. 

Since our edit distances depend on several parameters, we introduce a 
nomenclature to make this explicit. 

Definition 3.12. Let E be a set and let 1 < p < oo. Suppose D is an edit 
distance extending a quasi- metric d on T, and gap penalties a, [3 such that 
aP,pp G rcL(E). We will write D = EQ^{d,a, (3) if D is a quasi-metric and 
D = EM^{d, a) if D is a metric (it is necessary that a = f3 if D is a metric). 

Most (if not all) instances of edit distances in computer science, compu- 
tational biology and pure mathematics involve the edit distances. Below, 
we outline some of the well-known examples. 

Example 3.13. The Levenstein metric [46j (the original 'string edit dis- 
tance') is the smallest number of permitted edit operations (substitutions 
and indels) required to transform one string into another. In our nomencla- 
ture, for a set of letters S, the Levenstein distance is realized as EM {d, a) 



where a{u) = \u\ for all w G S"*" and d is the discrete metric, that is, for all 
a, 6 G S 



d{x -k z,y -k z) < d{x, y) and d{z -k x,z-ky) < d{x, y) . 



(9) 




(10) 
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Example 3.14. The Sellers distance, introduced by Sellers in 1974 [S7j, is a 
metric obtained by extension of a metric d on the set E| = S U {e}, the set 
of generators plus the identity element, to the free monoid S*. It is realized 
as EM^((i, a) where a{u) = d{ui, e) for all u G S+. 

This construction has long been known in the theory of topological groups 
as the Graev metric [25] on the free group -F(S). Recall that F(S) 
consists of all sequences of letters from the generating set E and their inverses; 
in other words, -F(S) = Y*, where Y = EUS~^ and is the set consisting 
of inverses of elements of S. Let p be a metric on the set Y^ = Y U {e}. 
The Graev metric p is then a maximal invariant metric on such that p 

restricted to the set Y^ is equivalent to p. Note that the notion of invariance 
in this context is slightly different than the definition of an invariant quasi- 
metric on a semigroup from Remark 13.111 above: a metric p on a group {X, ★) 
is called invariant with respect to -k if for all x,y, z E X, 

p{x -k z,y -k z) = p{z -k X, z -ky) = p{x, y). (11) 

The maximality of the Sellers-Graev metric can also be observed in the 
context of the free monoid E* using the following argument. Let D = 
EM^{d,a) where li is a on S and a is a gap penalty. Define a metric d^ 
on by 

{D{a,b) if a,beJ:, 
a{a) if6 = e, (12) 
a{b) if a = e. 

It is clear that D extends (i| from Sj- to S*. However, for every x G S*, 

Dix, e) < aP(xi) j < "(^^) 

and hence every edit distance extending d-^ to S* will be smaller than the 
Sellers-Graev distance. 

Example 3.15. Let S be a set and for w, f G S* denote by LCS{u,v) the 
longest common subsequence of u and v. Define 

p{u, v) = \u\ + l^;! — 2 |LCS('U, v)\ . 

It can be easily shown that p is a metric on S* and that p can be realized 
as EM^{d, a) where a{u) = \u\ for all u G S+ and d{a, b) = 2 for all a, 6 G S 
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such that a ^ h (cf. [25], pp. 246). Since d{a, h) > a{a) + a{b), the optimal 
ahgnment can be expressed solely in terms of insertions and deletions . The 
longest common subsequence metric provides a special case of the Sellers- 
Graev metric. 

3.3 Alignment decomposition 

Recall that Lemma 13.71 indicates that the total edit distance D between 
two words X and y can be optimally decomposed as an sum of the distances 
between constituent factors of x and y. Lemma [3 . 1 71 below shows that, if the 
gap penalties are increasing, an arbitrary choice of a factor y' of y decomposes 
the edit distance between x and y into sum of the edit distances between 
fragments of x and y. In this case, all of x is used up while some parts of 
y could be 'lost' (Figured]). A similar splitting can also be achieved with a 
choice of a fragment of x. We call this property arbitrary decompos ability. 

Definition 3.16. Let S be a set, let p : S* x S* be a distance function on the 
free monoid S* and let 1 < p < oo. We say that p is arbitrarily decomposable 
of order p if for all x, y G S*, 

(i) For every y' G there exist x',xl,xl G d{x) such that x = x^x'xl^ 
and yl, y^.u^v G such that y = yluy'vy2 and 

p{x,y)>(^pP{xl,yl)+ff{x\y') + fP{xlyl)y'- (Al) 

(ii) For every x' G ^(x) there exist y',yl,y2 G d{y) such that y = yly'y^ 
and xj, Xg, "u, f G ^(x) such that x = x\ux'vx2 and 

p{x,y) > l^pP{xl,yl) + pP{x',y') + pP{x;,y;)y'\ (A2) 

Note that if the distance function p is symmetric, the two properties above 
collapse into a single one. 

Lemma 3.17. Let 11 be a set and /et o? : S x S — M. Suppose that a and (3 
are increasing functions S"*" M such that a^, (3^ G r(S) and D is an (P edit 
distance on E* extending d, a and (3. Then, D is arbitrarily decomposable of 
order p. 
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ym y*n 



Figure 1: Arbitrary decomposability (part RTI) of an alignment. A choice of 
y' induces a decomposition of both x and y such that x = xx'x, y = yuy'vy 
and p{x,y) > (yfP{x,y) + ff{x',y') + //(x, y))^^^. Dashed lines indicate the 
boundaries of edit operations. The fragments u and v of y are 'lost': they do 
not contribute to decomposition. 



Proof. We will prove only the first part of the definition of arbitrary decom- 
posability because the second follows by the same argument. Let x,y E T,* 
and let y' G d{y)- By Lemma [331 the distance D{x, y) can be written as 

/ K \ Vp 

D{x,y)=[y^D^{xlyl)\ , 

where x = x^Xg • • • y = y^y^ ■ ■ ■ y^- Let l<'m<n<Khe such that 
y*m 7^ e, y*^ e, y' e d{y*m--- Vn) and y*^^^ . . . y*^_, G d{y') (i.e. y*^...y*r, 
is the smallest factor of y having y' as a factor - see Figure [1]). Then, the 
fragments and y* contain parts of y'. (Note that y' always coincides with 
y^ ■ ■ ■ yn if the gap penalties depend only on composition.) 

Consider the fragment y^. According to Lemma [3.7^ y^ can be either a 
letter {y^ G S) or a fragment {y^ G S*), since the possibility of y^ = e was 
explicitly excluded. If y^ G S, let m = e and u' = y^ so that D{x'^,y^) = 
D{x^,u'). On the other hand, if y^ ^ S, then by Lemma \377\ = e. Let 
u, u' G E* be fragments of y^ such that y^ = uu' and u'l = y[ (i.e. we split 
y^ into a part not overlapping with y' and a part overlapping with it). It 
is possible that u = e but we always have u' G S+ by construction. By our 
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assumption about increasing gap penalty, it follows that 

Dix*^, y*m) = -D(e, uu') = a{uu') > a{u) = D{x*^, u'). (13) 

In a similar way, the fragment y* can be expressed as = v'v where 
y'ly'l ~ ^' contains the end of y') and 

D{x:,y:) = D{e,v'v) = a{v'v) > a{v) = D{xlv'). (14) 

Now, let X = xl . . . x^_i, x' = xj^ . . . x* and x = . . . x*j^. Let y = 
?/!■■■ Vm-i and y = y^^^ ...yK- Then, x = xx'x, y = yuy'vy and 

/ K \ 1/P 

D{x,y)= [Y,D^{xl,yl) 



n-l 



DP{x,y) + DP{xl,uu')+ J2 DP{xl,yl)+DP{xlv'v)+DP{x,y) 



i/p 



k=rn+l 

'^"^ 1/p 

>{DPix,y)+D^{xl,u')+ J2 D''{xlyl)+D^{x:,v') + DP{x,y)] 

k=m+l 



> (^DP{x,y)+DP{x',y')+DP{x,y) 



1/p 



since {x*^, wOl^^m+i, • • • ?/n-i)«> '^') is an alignment of x' and y' 

and hence the i"^ sum of distances over it is greater than Dp{x', y') by Lemma 
1X71 □ 

Therefore, any i'^ edit distance with composition-length gap penalties is 
arbitrarily decomposable of order p. However, there exist arbitrarily decom- 
posable distances that are not edit distances. 

Example 3.18. Let S be a finite set and let d he a metric on S. For any 
n eN, the generalized Hamming distance dn on E"' is given for all a;, y e 

by 

n 

dnix,y) = ^d{xi,yi). (15) 

1=1 

It can be easily shown that dn is a metric. The generalized Hamming distance 
is a natural generalization of the Hamming distance [26] where the distance 
d on E is the discrete metric. 
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Let / : S ^ M be a function such that for all a, 5 G S, 

\f{a)-f{b)\<d{a,b)<f{a) + f{b). (16) 
It immediately follows that for every n G N and for all x, y G S", 

\fix) - fiy)\ < d^ix,y) < fix) + fiy). (17) 
Define the distance p : S* x S* — M by extending dn and / so that for all 

p^x y) = l^''^^'^"^ if = 12/1 = ri, ^^^^ 
\fix) + f{y) ii\x\^\y\. 

Using 0171) . it is easy to show that p is a metric on S*. Furthermore, 
p is arbitrarily decomposable (of order 1). Indeed, consider x,y E T,* and 
y' G d{y)- If \x\ = \y\, one immediately obtains the required decomposition 
using the form of the generalized Hamming distance. On the other hand, if 
|x| 7^ \y\, we have 

p{x, y) = fix) + f{y) > p(x, e) + p(e, y') (19) 

leading to the decomposition where x' = e and u and v take all of y apart 
from y' . 

The metric p (generalized to (P form) can be interpreted as an 'ungapped' 
version of edit distances. Here substitutions are allowed only between se- 
quences of equal length and the function / plays a role of gap penalty so 
that the only way to transform sequences of unequal length is through a full 
deletion followed by insertion. 



4 Global Similarity 

A more common approach to sequence comparison is to maximize similari- 
ties instead of minimizing distances. In this similarity measure on E 
and gap penalties are used to define the similarity between two sequences 
in S* using the Needleman-Wunsch [55] or Smith- Waterman [69] dynamic 
programming algorithm, which are very similar to the algorithm for compu- 
tation of edit distances described above. As in the case of P' edit distances 
above, we define sequence similarities using a recursive definition. 
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Definition 4.1. Let S be a set, s : S x S — > M, and let 7, 5 G r(E). For any 

x,y G S* where m = \x\ and n = \y\, define the global (Needleman-Wunsch) 
similarity S : {x,y) using the following recursion: 

(a) S'(xo,2/o) = 'S'(e,e) = 0, 

(b) S{e,yj) = -7(?/j) = for all I < j < n, 

(c) S{xi, e) = —S{xi) for all 1 < « < m, and 

(d) for all 1 < 2 < m and 1 < j < n 



The global similarity between the sequences x and y (extending s, 7 and 6), 
is defined by S{x,y) = S{xm,yn)- 

The algorithm used to compute edit distance (Remark I3.5P can also 
be used for computation of similarities by setting d = —s, a = 7 and P = S, 
computing D for p = 1 and then taking S = —D. The running time of the 
dynamic programming algorithm depends on the properties of gap penalties, 
as discussed in the previous section. Note that the gap penalty functions 
are positive in the case of both distances and similarities, being added in the 
former case and subtracted in the latter. It is also possible to express global 
similarity as a sum of similarities over alignments, as is done for edit distance 
in Lemma [3. 7[ 

Example 4.2. It is well known [25] that the longest common subsequence 
problem described in Example 13.151 can be approached using similarities 
rather than distances. Let S be a set and let s be a scoring function on 
S such that s{a, b) = if a ^ b and s{a, a) = 1. Let 7(0;) = 6{x) = for all 
X G It is easy to confirm that for x,y & S*, S{x,y) = \LCS{x,y)\. 



Relations between global similarities and £^ edit distances were explored 
early on [TDIEB]. 



S{xi,yj) = max < S{xi_i,yj^i) + s{xi,yj) 



max {S{xi, yj^k) - liVj-k+i •••%)}, 



(20) 
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Theorem 4.3 ( [7CTI [55]). Lei 6e i/ie global similarity with respect to 5,7 
and 6 such that for all x G 7(x) = 5{x) = ifj{\x\), where i/j is a positive 
function. Consider the edit distance D, extending d : S x S — > M and the 
gap penalties a and (3 and let sm = max{s(a, b) \ a, 6 G S}. Suppose for all 
a,b E 

d{a,b) = Sm - s{a,b), (21) 

and for all x G 

aix)=Pix) = '-^ + ij{\x\). (22) 

Then, S and D will induce equivalent sets of optimal alignments and for all 
x,ye S*, 

D{x,y) = sm ^^^ ^ - S{x,y). (23) 

□ 

The distance function obtained by taking a constant minus similarity is 
not guaranteed to satisfy any of the axioms for a metric or a quasi-metric: 
one problem is that the self-similarity S{x, x) for any a; G S* is not neces- 
sarily a constant. However, under some more restrictive but frequently valid 
assumptions, it is possible to transform similarities into metrics or quasi- 
metrics. We establish the results that have interesting biological interpre- 
tations and provide the foundation for considering transformation of local 
similarities, discussed in Section O to quasi-metrics. 

Definition 4.4. Let X be a set and let s be a (similarity) map X x X ^ M.. 
We call s a sane scoring function if for all x,y & X, 

(i) s{x, x) > 0, 

(ii) s{x,x) > s{x,y), and 

(iii) s{x, x) > s{y, x). 

Thus, a similarity map is sane if every element of S 'keeps its identity' 
with respect to it. Every point is similar to itself and this similarity cannot 
be smaller than similarity to any other point. 
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Proposition 4.5. Let he a set and let s : x ^ W be a a sane scoring 
function over S. Suppose 7, 5 € r(S) and S the global similarity on S* with 
respect to s, 6 and 7. Then, S is a sane scoring function and for all x G S*, 

\x\ 

S{x,x) = ''^^s{xi,Xi). (24) 

i=l 

□ 

Proposition 14.51 and Theorem 13.101 give us a straightforward way to con- 
vert global similarities to (quasi-) metrics. Since this transformation is based 
on the transformations of similarity scores to distances on generators, we first 
introduce additional nomenclature. 

Definition 4.6. Let S be a set and let 1 < p < 00. For a sane scoring 
function s on E, we will use AQ^(s) to denote the distance g on E given by 

q{a, h) = {s{a, a) - s{a, b)) ^'"^ (25) 
and AM^(s) to denote the distance d on S given by 

d{a, b) = (s(a, a) + s{b, b) — s{a, b) — s(6, a)Y^^ . (26) 

Note that at this stage we do not make an assumption that AQ^(s) is a 
quasi-metric nor that AM^{s) is a metric. 

Corollary 4.7. Let T, be a set and let 1 < p < 00. Suppose s is a sane 
scoring function onT,, d = AQ^(s) is a quasi-metric on S and 7, 5 G TclC^) 
such that 

lib)-jia)<d^ia,b) (27) 

and 

s(a, a) + 5{a) - s{b, b) - 6{b) < ^^(a, b). (28) 

Let S be the global similarity with respect to s, 7 and 6 and let a{x) = 7(0;)^/^ 

and (3{x) = (^S{x,x) + 6{x)Y^^ for all x G S+. Then, the (J^ edit distance 
D = EQ^{d, a, (3) is given for all x,y & T,* by the formula 

D{x,y) = (^Six,x)-Six,y)Y'. (29) 

□ 
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As with edit distances, we now introduce a nomenclature for quasi-metrics 
and metrics obtained from similarities. 

Definition 4.8. Let S be a set and let 1 < p < oo. Suppose D is an £^ edit 
distance obtained from a global similarity S on S* using the formula fl2^ of 
Corollary 14.71 where S extends s : S x S and 7, 5 G FclIS). We will write 
D = GQ^(s, 7, 6) if D is a quasi-metric and D = GM^(s, 7, 6) if D is a metric. 

The above nomenclature is redundant, in that every distance derived from 
similarities using Corollary 14. 71 can be expressed using the nomenclatures for 
edit distances and distances on E introduced in Definition 14.61 We have cho- 
sen to nevertheless introduce the additional notation in order to emphasize 
that the distances on the free monoid are derived from similarities and also 
because the computation of distances can be performed using algorithms for 
similarities. This notation will also be convenient in the following sections, 
where local similarities are discussed. 

Example 4.9. Let S be a set and suppose s is a sane symmetric function 
SxS — i> M and 7 G rcL(S), depending only on length. This is a very frequent 
setup in pairwise comparison of DNA and protein sequences (see Section 
|6] below for more detailed discussion). Define for all a,6 G S, s'{a,b) = 
2s{a, b) — s{b, b) and for all x G S^, 7'(a;) = 27(0;) + s(xj, Xi) and 5'{x) = 
2jix). 

Suppose that the distance d = AQ^{s') = AM^(s) is a metric on S. Since 
s is sane, s' is also sane and we have 

\s'{a, a) — s'{b, b)\ = \s{a, a) — s{b, b)\ < (F{a, b). 

Therefore, since 7 depends solely on length, the requirements fl27|) and fl28l) 
of Corollary 14.71 are satisfied. Let S be the global similarity extending 5,7 
and 7 and let S' be the global similarity extending s', 7' and 5' . We conclude 
that the distance D given by 

D{x, y) = {S'{x, x) -S'{x,y)) = [S{x, x) + S{y, y) - 2S{x, y))''" (30) 

is the metric GM^(s', 7', 5'). This metric can also be expressed as EM^(AM^(s), a), 
where a{x) = [S{x,x) + 7(0;))^''^ for all x G S+. 
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5 Local Similarity 



Local similarity is computed using the Smith- Waterman algorithm [69] . 

Definition 5.1. Let S be a set, s : S x S ^ M, and let 7, 5 G r(S). Let 

x, 2/ G S*, m = \x\ and n = \y\. The Smith- Waterman dynamic programming 
matrix, denoted SW(x, y, s, 7, 6), is an {m + l) x {n+ 1) matrix H with rows 
and columns indexed from such that Ho,o = and for all 1 < z < m and 
1 < j <n, Hifl = 0, Hoj = and 

Hjj = max <^ Hi_ij_i + s{xi, yj), max {Ui^kj - S{xi^k+i ■ ■ ■ Xi)} , 

I l<fc<i 

max {Hij_k - j{yj-k+i ■■■Vj)} , o\. 
i<fc<i J 

The local similarity between the sequences x and y (given s, 7, and 6), 
denoted H{x,y), is defined to be the largest entry of H, that is, H{x,y) = 

IXlcLXj j H2 j • 

Local similarity between two words can be realized as global similarity of 
their fragments. 

Theorem 5.2 ([68j). LetE be a set, s : S x E ^ M and 7, 5 G r(S). Suppose 
S is a global similarity extending s, 7 and d and H is the local similarity with 
respect to s, 7 and 6. Then, for all x,y E T,* , 

H{x,y) = max S{x',y'). (31) 

□ 

Although conversion of global similarities to distances outlined in Section 
Hlis relatively straightforward, its counterpart for local similarity is much less 
so. We now use the results from the previous sections to state our main result: 
construction of quasi-metrics which include conversions of local similarities. 

Theorem 5.3. Let H be a set and let 1 < p < 00. Let p be a separating 
quasi-metric on S* that is arbitrarily decomposable of order p. Suppose f is 
a strictly positive and g is a non-negative function S ^ M and f and g are 
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the canonical homomorphic extensions of f and g, respectively, to the free 
monoid S*. Assume also that for all x,y & S*, 

fix) - f{y) < f)P{x,y) and g{y) - g{x) < if{x,y). (32) 

Then, the function Q : S* x S* ^ M defined by 

Q{x, y) = mill | (f{x) - f{x) + g{y) - g{y) + ff{x, y)) ' 1 (33) 
is a quasi-metric on S*. 

Proof. Let x,y, z E S*. Since /(x) > f{x) and g{y) > g{y) for any x G 5^(x), 
y G ^{y) and since p is a quasi-metric and hence positive, it follows that 
Q{x,y) > 0. Furthermore, it is clear that Q{x,x) = 0. 

Suppose that Q{x,y) = 0. Then, there exist x G and y G 5^(y) 

such that f{x) - f{x) + g{y) - g{y) + /f{x,y) = 0. Since p{x,y) > 0, 
/(x) — f{x) > and g{y) — g{y) > for any x,y E S*, it follows that 
/(x) = /(x), ^(y) = ^(y) and p{x,y) = 0. The first statement implies 
that X = X since / is a strictly positive function, while the last means that 
X = y (since p is a separating quasi-metric). Therefore, Q{x,y) = implies 
X G d{y)- Hence, Q{x,y) = Q{y,x) = implies x G '^{y) and y G ^^(x) and 
thus x = y. 

To establish the triangle inequality suppose that 

g(x, y) = (/(x) - /(x) + g{y) - g{y) + /(x, y)) (34) 
for some x G 5^(x), y, G ^{y) and 

Q(i/, ^) = (/(y) - /(?/) + ^(^) - m + /(y, i)) (35) 

for some y G 3^(2/) and i G Write out y = ytyi+i . . . yi+rn-i, y = 

yjyj+i . . .yj+n-i where m = \y\, n = \y\, l<z<z + m — 1<|?/| and 
l^i<J+^~l<|2/|- Ify and y overlap, that is, if i < j < m or 
j < i < n, let y' denote the whole overlapping fragment (for example, if 
i < j < i + m — 1 < i + n — 1, y' = yjyj+i ■ ■ ■ yi+m~i - see Figure [2]). If y and 
y do not overlap or either y or y is identity, let y' = e. 
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Figure 2: Decomposition of x, y and z. In this pattern of overlap of y and 
y, we have = y2 = yl = zl = e. 

Since p is arbitrarily decomposable of order p, there exist x', £*, G 5^(5) 
such that X — x\x'x2 and yl, ^3, u,v & ^{y) such that y = y{uy'vy2 and 

KxJ) > (/(xt,yr) + /(x',y')+/(^;j2*))'^'- (36) 

Furthermore, by the same assumption, there exist z',zl,Z2 G ^(i) such that 
z — z{z'z2 and yjf , 't' G S^(y) such that y — yliiy'vyl and 

p(y, i) > (/(yr, i*) + p^(y', z') + p^(y;, i2*)) ■ (37) 
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Therefore, using the Minkowski inequahty, 

Q{x, y) + Q{y, z) > (f{x) - f{x) + g{y) - g{y) 



i/p 

+ pP(xl,yl) + ffix^y') + ff{x;,y;)' 



+ {fiy)-fiy) + -9iz)-m 

i/p 



+ ff(ylzl)+ffiy',z')+(fir2,Z;)^ 

> { fix) - fix) + fiy) - m + ffiil,y*i) + ffixl yl) 

+ giy) - giy) + giz) - giz) + zi) + (f{yi, i*) 



^{p{x\y') + piy\z')y'"' 



Since / and g are additive functions that satisfy the inequahty (1^ and 
since y' is the fuU extent of the overlap between y and y, we have 

fix) - fix) + fiy) - fiy) + fPiil,yl) + f^i^l y^) 
> fix) - fix) + fiy) - fiy) + fixl) - M) + m) - m) 



and 



> fix)- fix') >0, 



giy) - giy) + giz) - giz) + ffiyl, zl) + ffiyl, i*) 

> giy) - giy) + giz) - giz) - giyl) + gizl) - giyl) + giz. 

> -g{z)--g{z')>Q. 



Hence, by the triangle inequality for p, 

Q(x, y) + Qiy, z) > (/(x) - fix') + giz) - giz') + //(x', /))'^' > Qix, z), 

as required. □ 

Remark 5.4. We have shown in the separation part of the proof of Theorem 
15.31 above that Q{x,y) = =^ x G ^iy) and hence the associated partial 
order of the quasi- metric Q is x <q y <^=^ x G diy)- If g' is a strictly 
positive function, Q is a separating quasi-metric and the partial order is 
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trivial: y) = implies x = y and hence each point is only comparable 
to itself. 

However, if g is zero everywhere, then Q{x,y) = and x ^ y implies 
that factor of y while y is not a factor of x, so that x and y are 

non-trivially comparable. In this case, the quasi-metric Q is not separating 
and it generalizes the substring partial order: for every x,y E S* such that 
x G d{y), we have Q{x,y) = 0. Therefore, Q{x,y) can be interpreted as 
measuring how far is x from being a factor of y. 

Since the identity e is a trivial factor of every word and /(e) = 0, it 
follows that (in the case of = 0) Q{e, x) = for every x G S"*", in contrast to 
p(e, x) > 0. On the other hand, it can be easily seen that Q{x, e) = (/(x))^/^ 
and hence for all y G E+, Q{x, y) < {f{x)y^^ = Q{x, e). 

We now introduce a nomenclature for quasi-metrics and their associated 
metrics defined in Theorem 15. 3[ 

Definition 5.5. Let S be a set and let 1 < p < oo. Suppose p is a separating 
quasi-metric on S* and / and g are functions S ^ M that satisfy all the 
requirements of Theorem 15.31 with respect to p and p. Let Q be the quasi- 
metric obtained using the formula (!33|) of Theorem 15. 3[ We will write Q = 
LQ^(p, /, g) if Q is a quasi-metric and Q = LM^(p, /, g) if Q is a metric. 

Remark 5.6. Edit distances described in Section [3] are always global: they 
measure the full cost of transformation between two words in E*. Indeed, a 
truly 'local' distance, that is the distance measured on factors of words being 
compared, would not satisfy the triangle inequality. 

The LQ^ distances are slightly different. The distance p contributes to 
Q by evaluating the pair of factors x and y that are 'closest' to each other 
(relative to / and g), while / and g score the left-over pieces of x and y, 
respectively. The extent of x and y relative to x and y depends on the exact 
choice of functions / and g and their relation to the distance p. For example, 
when / and g are very large compared to p, the factors x and y will approach 
the whole sequences x and y. On the other hand, if / and g are small, they 
will contribute most to LQ^(p, f,g), depending on the exact properties of p. 

When both / and g are strictly positive, the LQ^ distance has a global 
character in that the whole of x and y are accounted for. If = 0, only 
X contributes to the distance as a whole; the sequence y contributes only 
through its factor closest to a factor of x. In general, it is possible to favor x 
or y by appropriately choosing the values of / and g. 
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Theorem 15.31 can be applied to similarities in the following manner. Let 
Q = LQ^(p, /, g). Define a global similarity cr on S* by 



(^{x, y) = f{x) + g{y) - ff{x, y). 



(38) 



Then, 



Q{.x,y) 



( 



/(a;) +g{y) 



{J{5:)+g{y)- ff{x,y)]^ 



1/p 



max 



/(x) +g{y) 




) 



1/p 



(39) 



Hence, if a can be computed using the Needleman-Wunsch algorithm (that 
is, if p is an edit distance), then Q can always be evaluated by using 
the Smith- Waterman algorithm to compute the local similarity H{x, y) = 
max{ cr(a;,^) | x G ^{x),y G d{y)} and then using Equation fl39|) . 

Since the functions / and g as well as the quasi-metric p are arbitrary, 
the applicability of Theorem 15.31 to similarities is very wide. The following 
examples are simple corollaries of Theorem 15.31 and the results in Sections [3] 
and m that have important uses in computational biology. 

Example 5.7. Let S be a finite set and suppose s is a sane symmetric 
function S x S — M such that the distance d = AQ^(s), is a metric on E. 
Let p = min{s(a, 6) | a, 6 G S} and let /(a) = s{a,a) — p. It is clear from 
the definitions of / and d that |/(a) - f {b)\ < d{a, h) < f{a) + f{b). 

Let p be the arbitrarily decomposable metric extending the generalized 
Hamming distance based on d and / to S*, as in Example 13.181 and define 
(7 : S ^ M by a{a) = s{a,a). By Theorem 15.31 we can construct the dis- 
tance LQ^{p, g, g), which is in fact the metric LM^(p, g, g). The underlying 
similarity a, given by Equation fl39l) . is 



where s„(x, y) = Yl'i=i ^{^iy Vi)- In computational biology apphcations, p will 
be negative (there will be at least two points in S that are dissimilar) and 
hence the local similarity will always be realized by aligning the fragments 




(40) 
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of the same length. Therefore, the local similarity based on a is gapless sim- 
ilarity, which has considerable historical importance since the first version of 
BLAST [Ij suite of tools for sequence database search based on local similar- 
ities used a heuristic that computed gapless alignments. Gapless alignments 
had an advantage that they could be computed faster and the statistics of 
similarity scores arising from them were well characterized [57] . 

In the following examples 15. 8[ 15.91 and I5.10[ we will assume that s : 
S X S ^ M is a sane scoring function, -y,S E Tci^iJ]) only depend on length 
and S and H are global and local similarity with respect to s, 7 and 6, 
respectively. In addition, let /(a) = s(a, a) for all a G S. 

Example 5.8. Suppose AQ^(s) is a quasi-metric. By Corollary 14.71 the 
distance D on E* given by D{x,y) = S{x,x) — S{x,y) is a quasi-metric 
GQ^(s,7,5). Consider the distance Q = LQ^(GQ^(s, 7,5),/, 0). It is easy to 
see that /(x) = S{x, x) = H{x, x) and hence 

Q{x, y) = S{x, x) — max S{x, y) = H{x, x) — H{x, y). (41) 

As remarked earlier, the partial order associated with Q in this case is sub- 
fragment partial order. Furthermore, the triangle inequality for Q is equiva- 
lent to 

H{x, y) + H{y, z) < H{y, y) + H{x, z). (42) 
If H is symmetric, that is, if s is symmetric and 7 = 5, we have 

g(x, y) + H{y, y) = Q{y, x) + H{x, x), (43) 

and hence Q is a co-weightable quasi-metric and —H is a partial metric. 
Note that in this case, the triangle inequality is exactly equivalent to 
the triangle inequality for the symmetrization M{x,y) = Q{x,y) + Q{y,x) 
(Example 15.101) . that is, if M is a metric then Q is a quasi-metric. 

The fact that Equation fHTl) gives a quasi-metric was first established in 
[75] . Indeed, the two generate equivalent neighborhoods: for any 2; G S*, the 
set of all points y G S* such that H{x, y) > k is equal to the set {y G S* : 
Q{x, y) < e} where e = S{x, x) — k. 

Example 5.9. Recall the notation from Example 14. 9 [ where s is symmetric, 

7 = 5, s'(a, 6) = 2s(a, 6) - s{h,b), ^'{x) = 2'y{x) + ^iS{xi,Xi) and 6'{x) = 
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27(2:). Let S' and H' be global and local similarity with respect to s', 7' and 
6', respectively. 

Suppose that AQ^(s') is a quasi-metric (equivalently that AM^(s) is a 
metric) and consider the quasi-metric Q' = LQ^(GM'^(s', 7', 5'), /, 0). By the 
argument of Example I5.8[ 

Q\x,y) = {H'{x,x) - H'{x,y)Y^' = {S{x,x) - H\x,y)Y^r (44) 
However, in this case the local similarity 

H'{x,y) = max S'{x,y) = max{2S{x,y) - S{y,y)) (45) 

x,y x,y 

is clearly asymmetric. This similarity score has, to our knowledge, never been 
previously used for sequence comparison, although it can be easily computed 
using Smith- Waterman algorithm (provided that the particular implemen- 
tation used allows composition- length gap penalties). It has the advantage 
that it is still true that H' is topologically equivalent to Q' and that Q' 
corresponds to the subfragment partial order. 

The asymmetry of H' may be exploited to favor the integrity of one 
sequence over the other in biological sequence alignments. For example, 
in cases where translated DNA sequences are compared to proteins, it is 
desirable to emphasize the protein sequence, which is 'real' (experimentally 
established), at the expense of translated DNA sequences, which is only 
hypothetical. We intend to evaluate the broad utility of using variants of H' 
and Q' for biological sequence comparisons in a subsequent publication. 

Example 5.10. Making the same assumptions as in Example 15.91 above, 
consider the metric M = lM^{GM'^{s' ,'y' , 6'), f, f). It is easy to see that M 
is indeed a metric given by 

M{x,y) = {H{x,x) + H{y,y) -2Hix,y)Y^r (46) 

Equation 0461) , for p = 1, was extensively considered in computer science 
and computational biology. The LCS similarities (Examples 13.151 and 14.21) 
are related to distances in this way. Linial et al. [47j proposed using M as 
a distance on sets of protein sequences but did not explicitly prove it was a 
metric. Spiro and Macura [71j have given the conditions under which M is 
indeed a metric. Since H is here assumed symmetric, this result is equivalent 
to LQ^(GQ^(s, 7, 5), /, 0) being a quasi-metric (Example 15. 8p . established by 
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Stojmirovic [73] under slightly different assumptions. Itoh et al. derived 
the same result as a corollary of a more general inequality for similarities that 
relied on the finiteness of the generator alphabet. In a poster abstract |T6], 
Fischer proposed the general form of Equation P6l) with arbitrary p as a way 
to convert similarities to distances and stated without proof the conditions 
for M to be a metric. 

For p = 2, the form of Equation resembles the formula for the canon- 
ical metric in inner-product vector spaces. In this and y would be 
vectors and H would be a positive-definite bilinear form. 

Theorem 15. 31 can be applied in the context of free abelian monoids with no 
change. We illustrate this by a very simple, followed by a more biologically 
relevant example. 

Example 5.11. Let S be the set of all prime numbers and let S* be the free 
abelian monoid over S under multiplication (i.e. the set of natural numbers 
N). Let dj be a discrete metric on S (here we implicitly assume that E 
includes 1) and let /(a) = 1 and a{a) = for all a G S. Let p be the Sellers- 
Graev metric extension of to N. It is clear that p{x, y) is just the number 
of different prime factors between x and y (the non-matching prime factors 
are matched to 1) and that it is arbitrarily decomposable. Hence, we can 
apply Theorem 15.31 to obtain a quasi-metric Q, so that Q{x, y) is the number 
of prime factors of x not in common to y. The global similarity u on N (here 
equivalent to local similarity), given by o'{x,y) = /(x) — p{x,y) evaluates to 
the number of common prime factors (excluding 1) between x and y. 

Example 5.12. Let S be a finite set and let y4(S'^) denote the free abelian 
monoid generated by the set of all words of length exactly k (we will call 
z G S'^ a /c-tuple). Members of A(T,^) are therefore multisets of A;-tuples. 
Now consider the same structure as in the previous example. 

Let be a discrete metric on U {e} and let /(a) = 1 and a{a) = 
for all a G S. Let p be the Sellers-Graev metric extension of d-^ to A(T,'^). 
All requirements of Theorem 1 5 . 3 1 st ill apply. The value Q{x, y) is the number 
of fc-tuples that are contained in x but not in y and the global (and local) 
similarity a gives the number of fc-tuples common to both x and y. 

The similarity a has been used in computational biology as a compu- 
tationally inexpensive approximation of global similarity between two se- 
quences [33 [12] • Each sequence is mapped to A{T,'') by taking the multiset 
of all of its (overlapping) fc-tuples and the similarity a is used to approximate 
the global similarity S. 
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6 Scoring Functions on Generators 



In the previous sections we have made no assumption on the set of generators 
E and all our results apply to arbitrary sets. However, as we noted before, the 
principal objects motivating our results are sets of biological sequences and 
profiles derived from them. The former two sets are finite and therefore the 
scoring functions over them are given by score matrices. We therefore proceed 
to discuss the similarity and distance measures on the sets of nucleotides, 
amino acids and profiles and their applicability to our theory. 

6.1 Nucleotide scoring matrices 

The nucleotide alphabet consists of only 4 letters (A, C, G, and T) and the 
score matrices most frequently used for database search depend on only two 
parameters, for scoring a match or a mismatch of two nucleotides. For exam- 
ple, the blastn program, a part of the BLAST [2] suite of tools for sequence 
database search based on local similarities, which searches a DNA database 
with a DNA sequence as a query, uses the scoring matrix of the form 



The above scoring function is obviously sane and the distance d = AQ^(s) is 
a discrete metric for any 1 < p < oo. Therefore, all match/mismatch scoring 
schemes satisfy the requirements of Theorem I3.1UI and its corollaries. 

More complex score matrices, where transitions (changes C^T and A^G) 
have different scores than transversions (all other mutations) have been pro- 
posed for improving the accuracy of database searches [721 [10]. It is easy 
to show that the distance AQ^(s) (and hence AQ^(s) for all p) will still sat- 
isfy the triangle inequality and hence be a metric if the value of distance 
associated by transition is not greater than twice the transversion distance. 
Since the likelihood and hence the similarity score of transition is larger than 
that of transversion, this condition is very likely to be satisfied in practice. 
For example, all scoring matrices examined by States et al. [72j satisfy this 
condition and are sane. 




(47) 



29 



6.2 Amino acid scoring matrices 

Unlike the nucleotide alphabet, the standard amino acid alphabet consists 
of 20 amino acids of markedly different chemical properties and structural 
roles. Hence, the regularly used amino acid scoring matrices are much more 
complex than the matrices over nucleotides discussed above. Many amino 
acid scoring matrices were developed over the years for various purposes, 
including sequence similarity search, structural prediction and phylogenetic 
analysis [SU [TTJ HQ]. Most of them arise from analysis of sets of peptide 
sequences known to be to a certain extent related. 

Dayhoff et al. P] proposed in 1970s the family of scoring matrices called 
PAM, which were based on a Markov model of evolution of proteins. PAM 
matrices were the original standard choice for sequence comparison. Several 
improved versions of PAM matrices were constructed later [20l |3ll [52], [53l [78] , 
in order to address some of the deficiencies arising from lack of sufficient data 
at the time of the construction of the original PAM family. For PAM-like 
matrices, the larger the number appended to their name (such as PAM-n), the 
sequences to be compared are assumed to have more diverged in evolution. 

Presently, the most widely used family of scoring matrices is BLOSUM, 
derived by Henikoff and Henikoff in 1992 [28] using an empirical procedure. 
In particular, the BLOSUM62 matrix has long been believed to be among 
the best performing matrices for general sequence similarity search ^29] and 
is used as default by BLAST (more specifically, the blastp program). In 
contrast to the PAM-like matrices, the larger the number appended to the 
name of a BLOSUM matrix, the more the sequences to be compared are 
assumed to be closely related. 

In addition to the above mentioned families, some score matrices were 
constructed specifically for searches involving transmembrane regions of pro- 
teins [35| [STl [56] while others were derived from structural alignments in 
order to improve sensitivity of searches involving distantly related proteins 
[631 [361 [5]. 

Table [T] shows the numbers of violations of the triangle inequality for the 
distances AQ^, AQ^ and AM^ obtained from several common (symmetric) 
score matrices. The matrices featured in Table [1] are all sane and represent 
only a very small sample of all existing amino acid score matrices that are 
most frequently used and cited. 

All of the scoring matrices mentioned so far were symmetric with the ex- 
ception of the SLIM family [^I] for comparison of transmembrane proteins. 
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Matrix 


Reference 


AQi 


AQ2 


AM^ 


PAM40 


M 


28 








PAM120 


M 


88 








PAM250 


M 


168 


21 





GONNET 




144 








BLOSUM45 


[28] 











BLOSUM50 


m 











BLOSUM62 


m 











BLOSUM80 


m 











JTT 


m 


170 


34 


34 


JTTtm 


M 


214 


18 


20 


BC0030 


m 


214 


12 


4 


SDM 


M 


134 








HSDM 


m 


142 


6 





OPTIMA 


[36] 


74 


15 


2 


PHAT75/73 


m 


6 








VTML160 


m 


28 








VTML250 




100 


14 





dist.20comp 


m 











PMB120 


m 











PMB250 


[78] 


8 


3 






Table 1: Number of triples of amino acids failing the triangle inequality for 
distances derived from various symmetric score matrices. All the matrices 
are considered over the standard (20 letter) amino acid alphabet (that is, 
excluding non-standard letters representing more than one amino acid). Due 
to symmetry of similarity scores, the triangle inequalities for AQ^ and AM^ 
are equivalent and the column for AM^ is omitted. 

Yu et al. [SZ| recently proposed a concept of compositionally adjusted score 
matrices, which are asymmetric and which can be derived from symmetric 
score matrices by considering different background frequencies of amino acids 
in the first vs. the second sequence. The rationale for compositional adjust- 
ment is that some proteins, especially from organisms with biased amino 
acid usage, can have significantly different background frequencies of amino 
acids, than the ones used to construct the standard matrices. It was demon- 
strated in [87j that using compositional adjustment results in improvement 
of sensitivity of pairwise sequence comparison. 
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C. 


tetani 






M. tuberculosis 




Matrix 




AQ2 


AM^ 


AM2 


AQi 


AQ2 


AM^ 


AM^ 


PAM40 


36 





36 





40 





40 





PAM120 


129 





126 





113 





116 





GONNET 


152 





152 





151 





150 





BLOSUM45 














4 





4 





BLOSUM50 


1 





2 





3 





2 





BLOSUM62 


1 





2 





1 





2 





BLOSUM80 


























JTT 


353 


11 


378 





320 


5 


330 





BC0030 


234 


3 


244 


4 


249 


2 


272 


4 


SDM 


132 





132 





132 


8 


132 





HSDM 


144 


1 


144 





143 





142 





OPTIMA 


77 


4 


78 


2 


78 


2 


80 


2 


PHAT75/73 


10 





12 





19 





26 





VTML160 


32 





34 





42 





50 





dist.20comp 


























PMB120 



























Table 2: Number of triples of amino acids failing the triangle inequality for 
various compositionally adjusted asymmetric score matrices. Each matrix 
was adjusted from a symmetric matrix by using the composition of either C. 
tetani or M. tuberculosis proteome as the first set of frequencies, together 
with the implicit amino acid frequencies from BLOSUM62 as the second set 
of frequencies. 

Table [2] shows the violations of the triangle inequality for the distances ob- 
tained from some of the matrices from Table [Tj adjusted to take into account 
the amino acid compositions of proteomes of bacterial species Clostridium 
tetani and Mycobacterium tuberculosis. Both of these species have composi- 
tionally biased genomes and proteomes. The matrices were constructed using 
a Newtonian procedure described in [86] and [3]. The background distribu- 
tion for the second sequence comes from the original BLOSUM62 matrix. 
In this way, the constructed similarity scores and distances can be used to 
compare sequences known to come from the above organisms to sequences 
from general datasets. 

Table [1] and Table [2] demonstrate that most scoring matrices, both sym- 
metric and asymmetric, can be converted to the AM^ metric while many can 
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be converted to AQ^ quasi- metric as well. In contrast, most matrices fail the 
triangle inequalities for AQ^ and AM^. Therefore, our generalization of edit 
distances and related sequence similarities to £^ form allows us to use a much 
wider class of matrices to construct (quasi-) metrics on the set of all protein 
sequences. This is in contrast to the £^-tjpe results from the previous work 
[751 m] . which only apply to the BLOSUM family plus a few more similar 
matrices. 

6.3 Profiles 

Recall (Example 12.21) that given a set S, a profile over S is a word in the 
free monoid M(S)*, that is, a finite sequence of finite measures over S. In 
biological applications, S is finite and therefore a profile x can be treated as 
a sequence of vectors Xj G M", where n = For each i, the vector y = x^ 
has non-negative entries. In some applications, it is further assumed that y 
is a probability distribution, that is, that J^j Uj = ^■ 

In biological context, profiles represent generalized sequences over the 
basic alphabet S where each position has a probability distribution of letters 
instead of a single letter. They were originally introduced by Gribskov et al. 
[21] in order to improve sensitivity of homology search by considering the 
information contained in multiple alignments of related proteins to query 
sequence databases. To do so, a Position Specific Score Matrix or PSSM, 
which gives a similarity score for each letter in S for each position in the 
query profile, is constructed. The profile- sequence comparison using PSSM 
can then be performed using the dynamic programming algorithms such as 
Needleman-Wunsch or Smith- Waterman. Profiles can also be used directly 
in probabilistic Hidden Markov Models [TT] . Profile-based homology searches 
are widely used and have been shown in general to be more sensitive than 
sequence database searches with normal sequences as queries [2l [TT]. 

Profiles can also be compared to other profiles as members of the free 
monoid M(E)* using distances or similarities discussed in Sections [3l H] and 
13 all that is necessary is to assign a distance or similarity measure on M(E) 
and gap penalties. Many scoring schemes were proposed in due course and 
we present only a few examples below. For a more detailed overview we refer 
the reader to the papers of Edgar and Sjolander [13] and Marti-Renom et 
al. [l9], which study their performance for aligning distantly related protein 
sequences. 

Let X, y G M" be two measures in M(S) and let s and d denote a similarity 
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and a distance function, respectively. The symbol ||-|| denotes the i'^ norm 
on R". 

Example 6.1. The simplest similarity score between two vectors, used in 
CLUSTALW software for multiple sequence alignment [75] (see also Section 
[7]) is to compute their average over a score matrix s on S: 

^(x, y) = X] Xiyjs{xi, yj). (48) 
* j 

In general, AQ^(s) and AM^(s) are not a quasi-metric or a metric, respec- 
tively. 

Example 6.2. A natural candidate for similarity score between two vectors 
X and y is their dot product, used in [65] : 

s(x,y) = x-y = ^Xj-yj. (49) 
j 

Clearly, d = AM^(s) is the standard Euclidean distance: 



d(x,y) = ||x-y||= /^(x,-y,)2. (50) 



Example 6.3. A variation of the above is the correlation coefficient or cosine 
of the angle between two vectors used in the LAMA algorithm [62] : 

s(x, y) = = , 51 

Here d = AM^{s) can be easily shown to satisfy the triangle inequality. 
In general, d does not separate points, but if x and y are assumed to be 
probability vectors, then d is indeed a metric. 

Example 6.4. The Jensen- Shannon divergence between two probability vec- 
tors X and y, denoted D'^^ is given by 



Xi log — ■ h yi log 



•^i ~^ Vi ~^ Vi 

34 



(52) 



While D'^^ is not a metric, taking the square root, that is, letting (i(x, y) = 
A/^D^^"(x7y) does give a metric [Hj. Yu [HH] proposed using this metric to 
compare probability distributions that are components of profiles while Yona 
and Levitt [83] used the following similarity score: 

5(x,y) = (l - I^^^(x,y)) (l + D-^' (^,-)) , (53) 

where tt denotes a background distribution. 

The above examples suggest that £^-type edit distances and global and 
local similarities arising from them, could be appropriate for profile-profile 
comparisons. 

7 Applications and Future Directions 

Our results provide a way to construct a large variety of metrics and quasi- 
metrics on free semigroups. In particular, we are able to extend the con- 
version of similarity score matrices into alphabet (generator) distances, to 
the corresponding conversions of sequence similarities, global and local, to 
sequence distances. Hence, we are able to treat biosequence sets as spaces 
with geometry. The metric and quasi-metric structures provide a much richer 
framework than the topologies induced from them: for biosequences, S is fi- 
nite and hence all topologies induced from edit distances or local similarity 
(quasi-) metrics are equivalent to the discrete topology. 

In terms of statistical characterization, since we allowed more general 
gap penalties and asymmetric scoring matrices, the established statistics for 
similarities may not be fully transfered to our general distances. For this 
reason, to fully exploit our general formulation, it is important to further 
elaborate on its statistical aspects, which is beyond the scope of the current 
paper. 

Apart from setting a general geometric framework for sequence compar- 
ison, most direct applications to biology involve clustering. For example, 
global clustering of protein sequences has been performed [171 [66], using the 
metric from Example 15.101 and other derivations from similarity score. How- 
ever, these works did not consider quasi-metrics and partial orders that could 
provide a more accurate view of the global protein sequence space. Applica- 
tions to indexing and multiple sequence alignment, which we discuss in more 
detail below, can also be considered as clustering. 
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Indexing for database search 

One of the principal motivations for establishing the triangle inequalities for 
similarity scores in the literature [731 CD [33] was to accelerate similarity 
search of large DNA and protein sequence databases. It has been identified 
early on that using the full Needleman-Wunsch and Smith- Waterman dy- 
namic programming algorithms to search sequence datasets by sequentially 
scanning all entries is prohibitively computationally expensive and heuris- 
tic methods such as FASTA [57] and BLAST [H [2] were developed. While 
very fast, these methods are not consistent [61], that is, they are not guar- 
anteed to retrieve all true neighbors of a given query point. Furthermore, 
both FASTA and BLAST sequentially scan all of the sequences in the dataset 
being searched. The idea behind using the triangle inequalities for acceler- 
ating similarity search is to use the intrinsic 'geometry' of the dataset and 
the space it lies in to construct an indexing scheme [271 |6T], a structure that 
allows fully retrieving a similarity query without scanning the whole dataset. 
A large amount of effort was spent on producing efficient indexing structures, 
principally concentrating on datasets that are equipped with a metric or a 
vector space structure: a good overview is by Hjaltason and Samet in [30] . 

Let X C S* be a finite sequence dataset. A range query of X based on 
local similarity H (depending on the score matrix s and gap penalties 7 and 
6), centered at the query point x G S* with threshold k is the set 

^h{x, K) = {yeX: H{x, y) > k}. (54) 

We will now consider some ways to construct an indexing structures that 
accelerate retrieval of ^h- 

The first way is to consider biological sequences purely as strings with sim- 
ple similarity measures often related to Levenstein distance and use string- 
based techniques such as hashing [191 El IHl [75] or suffix arrays [321 [31] . Such 
indexing schemes are often not consistent but may show good performance 
on datasets of DNA sequences where the similarity measure is very simple. 
For proteins, one approach was to construct a biologically meaningful met- 
ric on the amino acid alphabet and use the edit distance extension of it for 
sequence comparison and indexing [48] . This has an advantage that existing 
methods for indexing metric spaces can be directly applied but ignores the 
need for local similarities, which cannot be converted into edit distances. 

The other approach, investigated by Spiro and Macura [71] and more 
thoroughly implemented by Itoh et al. [33j, was to use the inequality fH2]) . 
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which holds for some amino acid scoring matrices (Tabled]). The idea is to 
cluster proteins according to the local similarity score H or associated metric 
lM\GM\s',i,6')JJ) (where similarity score is assumed symmetric and 
/(a) = s{a,a) - see Example IS.lOp . and then, when searching, to compare 
the query sequence to centers of clusters first and only scan those clusters 
that overlap the query. 

Note that while the neighborhoods of LQ^(GQ^(s, 7, 5), /, 0) are indeed 
equivalent to queries ^h, this is no longer true for neighborhoods of its 
metric symmetrization LM^(GM^(s', 7', 5'), /, /). Hence, direct indexing with 
respect to the local similarity metric may not be optimal. Furthermore, 
not all similarity score matrices give rise to quasi-metrics AQ^(s) (Table 
H]). Many more can be converted to AM^(s) and hence give rise to metrics 
lM\GM\s', 7', 6') J, /). Profile-profile comparison methods, relying on in- 
ner product for the distance between two distributions, also naturally induce 
£^-type distances. None of the methods described above can efficiently cope 
with this situation and yet there exists a simple way to convert such similarity 
queries to a sequence of metric queries. 

Suppose M(x, y) = [H{x, x) + H{y, y) — 2H{x, y))^^^ is a metric for some 
symmetric local similarity H. Let 

Z^ = {x EE* : H{x,x) = C}. (55) 

We call each set a fiber and it is obvious that S* is a disjoint union of all 
Z^, where ^ runs over the range of self-similarities. For our applications, this 
range is finite because the sequence datasets are finite. Now consider a query 
^h{x^ k) and let e{x^ ^, k) = {H(x, x) + ^ — 2^)^^^. It is easily established 
that 

^Hix,K) = [_\^{x,e{x,^,K))\Zi:, (56) 

where ^[x,e{x,^, k)) = {y E X : M{x,y) < 6{x,^,k)} (the closed ball of 
radius e{x, ^, k) about x). 

Hence, to process each local similarity range query, it is sufficient to 
process a metric range query ^(^x,e{x,^, k)) on each fiber and then collect 
the results. For practical purposes the fibers need to be reasonably large 
and small in number, but that is often true because the score matrices are 
integer- valued. Adjacent fibers that contain too few points can be merged 
if care is exercised when collecting final results. Each fiber can be indexed 
separately as a metric space with one of the many existing access methods 



37 



[20] or by using a new technique. The decomposition 0561) was proposed in 
form in [71] for indexing similarity-based range queries and was in turn 
inspired by decomposition of weightable quasi-metric spaces into fibers used 
by Vitolo [79]. 

Therefore, using fibers, a consistent indexing scheme can be constructed 
for most existing local similarity measures on biological sequences and pro- 
files. The performance of such schemes is not guaranteed - it depends on the 
exact geometry of sequence datasets [60l[6T]. Hence, our theoretical results 
represent only the first step towards efficient and consistent access methods 
that are to be achieved in future. 

An alternative to fiber decomposition for cases where LQ*'(GQ*'(s, 7, 6), f, 0) 
is truly a quasi- metric is to use the quasi-metric directly to index the dataset. 
Pestov and Stojmirovic [61] proposed the concept of a quasi-metric tree: a 
general indexing scheme for retrieving queries based on quasi-metrics and 
established conditions for its consistency. Note that in the £^ case, using 
inequahty (H2|) directly, as in [33], produces a structure that is equivalent to 
a quasi-metric tree. 

Progressive multiple sequence alignment 

Multiple sequence alignment (MSA) is among the most valuable tools in 
computational biology. It allows extracting and representing biologically im- 
portant commonalities from sets of sequences |25j. Construction of multiple 
alignments from sets of sequences has been extensively researched and a va- 
riety of techniques have been proposed [251 IIH]- The full dynamic program- 
ming algorithm for MSA is NP-complete [HO] and therefore heuristics are 
commonly employed. One popular heuristic approach is progressive align- 
ment [15]. First, a guide tree is constructed from pairwise dissimilarities 
between sequences. Then, larger and larger groups of sequences are aligned 
in pairwise manner, following the branching order of the guide tree from the 
leaves towards the root. A number of popular software packages for MSA of 
protein sequences [761 EH [12], |15] implement this heuristics. 

The success of this approach, greedy in nature, crucially depends on a 
faithful and evolutionarily meaningful construction of a guide tree for the 
set of sequences to be aligned. When constructing their guide trees, most 
methods do not use a true metric to compute pairwise distances [TSJ [321 
[12], while those that do [55], use the Levenstein distance, overlooking the 
similarities between closely related amino acids. 
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There are advantages in using the true metric distance for agglomerative 
hierarchical clustering. For example, the triangle inequality ensures the tran- 
sitivity of closeness in distance measure. Furthermore, when this is the case, 
it was shown that the difference between a hierarchical clustering and the 
optimal /c-clustering is bounded [8]. 

In this paper we have demonstrated a way to construct a large class of 
(quasi-) metric distances from similarity scores that also naturally account 
for functional relatedness among amino acids. The quasi-metrics developed 
in Section [5] can also provide a rigorous way to naturally interpolate from 
global to local similarities in constructing guide trees. 

Embeddings into vector spaces 

Let Q" be the metric symmetrizing the quasi-metric Q = LQ^(p, /, 0), where 
Q^{x,y) = Q{x,y) + Q{y,x) for all x,y E S*. Observe that by the triangle 
inequality for Q, 

{Kx)f'' - U{y)Yl^ = Q{x,e) - Q{y,e) < Q{x,y), (57) 

and hence, letting a{x) = {f{x)Y^^, we have 

\a{x)-a{y)\ < Q"(x,2/) < a{x) + a{y). (58) 

Flood, in his PhD thesis [17J and a followup paper [18] called any pair 
(p, a), where p is a metric and a a positive function, which satisfies the above 
property a normed pair. The triple (X, p, a), where (p, a) is a norm pair 
on X, is called a normed set [58]. Every normed space {E, ||.||^) naturally 
becomes the normed set by setting p{x,y) = \\x — y\\^ and a{x) = \\x\\^. 

For any two normed sets Xi = (Xi,pi,ai) and X2 = (X2,p2,a2), a 
function vr : Xi — > X2 is called a contraction if for all x G Xi, 

a;2(vr(x)) < ai{x) (59) 

and for all x,y E Xi, 

P2{7r{x),7r{y)) < pi{x,y). (60) 

According to a result of Flood [TTIIIH] (see also [58]), the normed pair struc- 
ture supports a natural embedding of X into a Banach space with a certain 
universal property. 
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Theorem 7.1 ([HI ^ EH]). Let X = {X,p,a) be a normed set. There 
exists a complete normed space B{X) and an embedding of X into B{X) 
as a normed subset such that every contraction ir from X to a complete 
normed space E lifts to a unique linear contraction 7f: B{X) —>■ E. The 
pair consisting of B{X) and embedding X ^ B{X) is essentially unique. 
Elements of X are linearly independent. □ 

Therefore, spaces of biological sequences with local similarity metric may 
be founded upon Banach (or even Hilbert) spaces. However, this result 
carries only theoretical significance at this point and cannot be directly used 
for clustering or indexing since the free Banach space B{X) is too large (it 
is not desirable that all sequences are linearly independent). Nevertheless, 
the same idea can be used to embed similarity score matrices into finite 
dimensional normed spaces and hence consider biological sequences as free 
semigroups over M". 

Acknowledgments 

A.S. is very grateful to Vladimir Pestov who as his Ph.D. and postdoctoral 
supervisor read and commented on the early versions of this manuscript. 
A.S. was supported by the University of Ottawa research funds. This work 
was supported by the Intramural Research Program of the National Library 
of Medicine at National Institutes of Health. 



40 



A Proofs 



A.l General conditions for edit quasi- metrics 

Theorem A.l. Let H be a set, let 1 < p < oo and suppose d is a separating 
quasi-metric on T,, a, (3 E T and D is the &' edit distance extending d,a and 
p. In addition, assume that for all a,b E T,, u,v,x E S*, 

(Wl) dP{a, b) + (3P{ubv) > pP{uav); 

(W2) dP{a, h) + aP{uav) > aP{ubv); 

(W3) f3P{uv) + (3P{x) > (3P{uxv); 

(W4) aP{uv) + aP{x) > aP{uxv); 

(W5) [5^{uxv) + a^{x) > (3P{uv); 

(W6) a\uxv)+ I3^{x) > aP{uv); 

(W7) aP{ux) + pP{xv) > aP{u) + (3p{v); 

(W8) (3P{ux) + aP{xv) > (3P{u) + aP{v). 

Then, D is a separating quasi-metric on S*. 

Proof. Let x,y, z E T,*. Clearly, D{x, y) is non- negative since all of d, a and 
P are non-negative. Also, D{x,x) < (^idP{xi,Xi))^^^ = 0. Now suppose 
D{x,y) = 0. Applying Lemma [3.71 we have 

/ K \ 

D{x,y)= lj2D'K,yl)] =0, 



where x = x*X2 . . . x^, y = |/*|/2 . . . y^, implying D{xl, yl) = for all k since 
D is non-negative. Hence, xl = yl for all possible cases of xl and yl because 
d is a separating quasi- metric and a and P are strictly positive on S"*". 

We will demonstrate the triangle inequality by relying on the Minkowski 
inequality: for any two sequences a and b of real numbers and 1 < p < oo, 

(\ i/p / \ i/p / \ i/p 

Ei^^+^^n <fEi«^n +fEi^^n • (^i) 
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We show by induction that for all < i < < j < \y\ and < A; < \z\, 

D{xi, yj) + D{yj, 4) > D{xi, 4). (62) 

Let ^ denote a partial order on N x N x N where {io,jo, ko) ^ (z, j, k) if io < i 
or io = i and jo < j or = i and Jq = j and k^ < k (lexicographic order). 
The relation ^ is a well-founded partial order of type (in this case our 
induction is finite) and our claim is trivially true for (0, 0). Assume it is true 
for all {i',j',k') -< {i,j,k). There are nine possibilities in total to consider 
for {i',j',k') = (z, j, k). 

Case 1: Suppose D{xi, yj) = (DP(xj_i, yj^i) + #(xj, yj))^^^ and D{yj, Zk) 
{DP{yj_i,Zk-i) + dP{yj, ZkjY^P. By the Minkowski inequality, our induction 
hypothesis and the triangle inequality on d we have 

D{xi, yj) + D{yj, Zk) = {D'\xi_i,yj_i) + d'\xi, yj)f''^ 

+ {DP{yj_,,Zk-i) + d'^{y„zj,)y/P 

> ((D(x,_i,%_i) + i^(y,_i,^fc_i))f 

+ {d{xi,yj)+d{yj,Zk)ry^'' 

> {D^{xi-uZk-i) + <F{xi,Zk)y''' 

> D{xi,Zk). 

Case 2: Suppose D{yj, Zk) = {DP{yj, Zk-t) + aP{zk-t+i ■ ■ ■ Zk)^^^ for some 
1 < t < k (this covers three possibihties) . By the Minkowski inequahty and 
the induction hypothesis we have 

D{xi, yj) + D{yj, Zk) = D{xi, yj) + (-D^(yj, Zk-t) + a^{zk-t+i ■ ■ ■ Zk)f'^ 

> {{D{xi, yj) + D{yj, Zk-t))" + a^izk-t+i ■ ■ ■ Zk))'^'' 

> {D''{x,rzk-t) + aP{zk-t+i...Zk)Y^'' 

> D{xi,Zk). 

Case 3: Suppose D{xi, yj) = {D''{xi_f, Vj) + l^'^i^i-t+i ■ ■ ■ for some 

1 <t < i (this covers additional two possibilities). Then, in similar manner 
as in Case 2, 

D{xi,yj) +D{yj,Zk) > {DP{xi_t,Zk) + /3^(xi_t+i . . . x^)) > D{xi,Zk), 
by the Minkowski inequality and the induction hypothesis. 
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Case 4: Suppose D{yj,Zk) = {DP{yj_t, Zk) + P^iyj-t+i ■ ■ .yj)Y^^ , for 
some 1 < t < j (this covers additional two possibilities). Using Lemma [3.7[ 
let < g < J be the smallest integer not larger than t such that 

/ K \ Vp 

y,) = D^ixr, + Yl v*J , 

V m=l J 

for some I < r < i, where u = Xr+i . . . Xi = . . . u}^, and v = yj^q+i ■ ■ - Vj = 
Note that g < t if and only if D(xr,yj-q) = [DP{xr,yj-q') + 

D'^{e,yj_q/+i . . .yj_q)Y^^, where q < t < q' < j. In that case, by our as- 
sumption (W7) and by Minkowski inequality, 

DP{xr, yj-q) + I3^iyj-t+i ■ ■ ■ Vj) = DP{xr, yj-q') + aP{yj-q>+i . . . yj-q) 

+ P^iy,-t+i...yj) 

> DP{xr, yj-q') + . . . yj-t) 

>D^ixr,y,-t) + P''iv). 

Of course, the same inequality trivially holds if t = g. 

Observe that assumptions (Wl), (W3) and (W5) imply that for any 1 < 
m < K and any Wi,W2 € S*, 

DPiu*„,,v*J+/3^iw^v:,W2) > [^"{w^ul^w^), (63) 

and hence 

K K 

D'i<, v*j + ny,-,+i •••%•)> E o + ^'("^^2 • • • o 

m=l m=2 

K 

m,=3 
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Therefore, 



K \ Vp 



m=l 



1/p 



1/p 



> 



1/p 



K 
m=l 

^ 1/p 

> D{xi,Zk), 

by the induction hypothesis. 

Case 5: The remaining case is D{xi, yj) = {D^\xi, yj_t)+ct^{yj-t+i ■ ■ ■ Uj))^^^ 
for some 1 < t < j and D{yj, z^) = {DP{yj_i, Zk-i) + dP{yj, Zk)y/P. The proof 
for this case exactly mirrors the proof for the previous case, now depending 
on the assumptions (W2), (W4), (W6) and (W8). □ 

Remark A. 2. In general the assumptions (Wl) - (W8) are sufficient for 

D to be a quasi- metric but not necessary, except in the case of p = 1. For 
example, let S = {a,b}, d{a,b) = d{b,a) = 3, a = f3, a{a) = 7, a{b) = 4, 
a{u) = ^ja(Mi)- In this case the assumptions (Wl) and (W2) fail but it 
can be verified that the triangle inequality for D does not fail for any p > 1. 

Remark A. 3. The assumptions (W1)-(W8) can be significantly simplified 
if the gap penalties take a more restricted form. For example, if the gap 
penalties are increasing, the assumptions (W5)-(W8) can be removed. This 
restriction is sensible in applications to biological sequence comparisons be- 
cause algebraic interactions lowering the effective length of the sequence are 
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not allowed. On the other hand, if S* is replaced as the underlying set with 
a monoid which is not free, or even a group, then gap penalties cannot be 
increasing in the above sense. 

Since composition-length gap penalties are increasing by definition. The- 
orem 13.101 is a direct corollary of Theorem lA.ll Furthermore, composition- 
length gap penalties with = 0, such as linear or affine, satisfy all of (Wl)- 
(W8). 

A. 2 Global similarities 

Proposition 4.5. Let H he a set and /ei s : S x S — M &e a a sane scoring 
function over S. Suppose 7, 5 G r(S) and S the global similarity on S* with 
respect to s, 6 and 7. Then, S is a sane scoring function and for all x G S*, 

\x\ 

S{x,x) = ''^^s{xi,Xi). (64) 

i=l 

□ 

We will make use of the following lemma, equivalent to Lemma 13.71 for 
distances. It was likewise proved by Smith and Waterman for the case 
and less general gap penalties. 

Lemma A. 4. Let Tj be a set, s : S x S ^ R, and 7, S : S+ W^. Suppose S 
is a global similarity on S* with respect to d, 7 and 6. Then, for all x, y G S* 

S{x,y)=me.xS^ZtiSixl,yl) \ {(xlyl))^^^ E Aix.y)^ (65) 

Proof of Proposition \4 ■ 5[ Let G S*. If x = e, by definition S{x,x) = 0, 
coinciding with a sum over the empty set. Since 7 and 5 are positive, we 
have — 7(y) = S{e, y) <0 and —S{y) = S{y, e) < 0. 

Now suppose X G S"*" and let {{xl,yl))^^^ G A{x,y) such that S{x,y) = 
EtiSixlvl). Let C = {k : xl E J: and yl E S} and D = {k : x*, E 
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S+ and yl = e}. Then, 

s{x,y) <j2s{xiyi) + J2^i<^yi) 

fcec fcGD 

k&C keD j 

\x\ 

i=l 

since s is sane and the whole of x is accounted for in fragments indexed by 
C and D. Therefore, 

\x\ 

Six, y) <^ s{xi, Xi) < S{x, x), (66) 

i=l 

implying S{x,x) = Yltli ^i^i^ ^i) > S{x,x) > S{x,y). In the same 
way it can be shown that S{x, x) > S{y, x) and hence that S is sane. □ 

Corollary 4.7. Let T, be a set and let 1 < p < oo. Suppose s is a sane 
scoring function onTj, d = AQ^{s) is a quasi-metric on S and 7, 5 G TclC^) 
such that 

lip) - 7(a) < rfP(a, h) (67) 

and 

s(a, a) + 5{a) - s{b, h) - 6{b) < dP{a, b). (68) 

Let S be the global similarity with respect to s, 7 and 6 and let a{x) = 7(0;)^/^ 

and /3{x) = [S{x,x) + 5{x)Y^'^ for all x E S+. Then, the edit distance 
D = EQ^((i, a, P) is given for all x,y E T,* by the formula 

D{x,y) = (^S{x,x)-S{x,y)y'. (69) 

Proof. By construction, G rcL(S) and by Proposition 14.51 (3^ G rcL(S) 
as well. By our assumptions on d, 7 and 6 and by Theorem I3.10[ it follows 
that D, the edit distance extending d, a and P, is indeed the separating 
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quasi- metric EQ^{d, a, on E*. We will now show by recursion that this 
quasi- metric is equivalent to the one given by Equation (1^ . 

Clearly, D{e,e) = (S'(e, e) — 5'(e, e))"*^^^ = 0. Let x,?/ G and suppose 
1 < i < \x\ and 1 < j < \y\. We have, 



,1/p 



D{e,y,) = afe) = 7fe)'/^ = (^(e, e) - Sicy,))' 
and 

D{xi,e) = /3{xi) = {S{xi) + S{xi,Xi)Y^^ = {S{xi,Xi) - S'(xj, e))^''^. 



Using recursion and Proposition 14. 5[ 



D{xi,yj) = (^min I^DP{xi^i,yj^i) + dP{xi,yj), 



min {DP{xi, yj^k) + a^(%-fc+i •••%)}, 

l<k<j 

min {DP{xi-k, Vj) + (3^{xi-k+i ...Xi)}\] 

l<k<i J I 



1/p 



min<{ S{xi-i,Xi-i) - S{xi^i,yj^i) + s{xi,Xi) - s{xi,yj), 
min {S{xi,Xi) - S{xi,yj^k) + ^{yj-k+i ■ • •%)} , 

l<A;<j 



mm 

l<A;<i 



in {S{xi^k,Xi-k) - S{xi-k,yj) + S{xi-k+i ---Xi) 
+ S{xi^k+i ■ ■ ■ Xi, Xi-k+1 ■ ■ ■ Xi)j f 1 



S{xi,Xi) - max |S'(xi_i,?/j_i) + s{xi,yj), 

max {S{xi, yj_k) - liVj-k+i ■ ■ ■ Vj)} 

l<k<j 



max 

l<fc<i 



: {S{xi-k, yj) - S{xi_k+i ■■■Xi)} \ 

i ) 



1/p 



_ _ , 1/p 

S{xi, Xi) S (^Xi, yj) 
as required. □ 



47 



References 



[1] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. 
Basic local alignment search tool. J. Mol. Biol, 215(3) :403-410, Oct 
1990. 

[2] S. F. Altschul, T. L. Madden, A. A. Schaffcr, J. Zhang, Z. Zhang, 
W. Miller, and D. J. Lipman. Gapped BLAST and PSI-BLAST: a new 
generation of protein database search programs. Nucleic Acids Res., 
25:3389-3402, 1997. 

[3] S. F. Altschul, J. C. Wootton, E. M. Gertz, R. Agarwala, A. Morguhs, 
A. A. Schaffcr, and Y.-K. Yu. Protein database searches using compo- 
sitionally adjusted substitution matrices. FEBS J., 272(20):5101-5109, 
2005. 

[4] R. Bellman, J. Holland, and R. Kalaba. On an application of dynamic 
programming to the synthesis of logical systems. J. ACM, 6(4):486-493, 
1959. 

[5] J. D. Blake and F. E. Cohen. Pairwise sequence alignment below the 
twilight zone. J. Mol. Biol, 307(2):721-735, Mar 2001. 

[6] J. Buhler. Efficient large-scale sequence comparison by locality-sensitive 
hashing. Bioinformatics, 17:419-428, 2001. 

[7] G. E. Crooks and S. E. Brenner. An alternative model of amino acid 
replacement. Bioinformatics, 21(7):975-980, 2005. 

[8] S. Dasgupta and P. M. Long. Performance guarantees for hierarchical 
clustering. J. Comput. Syst. Sci, 70(4):555-569, 2005. 

[9] M. O. Dayhoff, R. M. Schwartz, and B. C. Orcutt. A model of evolu- 
tionary change in proteins. In M. O. Dayhoff, editor. Atlas of Protein 
Sequence and Structure, volume 5, chapter 22, pages 345-352. National 
Biomedical Research Foundation, 1978. 

[10] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological sequence 
analysis. Cambridge University press, Cambridge, UK, 1998. 

[11] S. Eddy. Profile hidden Markov models. Bioinformatics, 14:755-763, 
1998. 



48 



[12] R. C. Edgar. MUSCLE: multiple sequence alignment with high accuracy 
and high throughput. Nucleic Acids Res., 32(5):1792-1797, 2004. 

[13] R. C. Edgar and K. Sjolander. A comparison of scoring functions for 
protein sequence profile alignment. Bioinformatics, 20(8):1301-1308, 
2004. 

[14] D. M. Endres and J. E. Schindelin. A new metric for probability distri- 
butions. IEEE T. Inform. Theory, 49(7): 1858-1860, 2003. 

[15] D. F. Feng and R. F. Doolittle. Progressive sequence alignment as a 
prerequisite to correct phylogenetic trees. J. Mol. Evol., 25(4) :35 1-360, 
1987. 

[16] I. Fischer. Similarity-preserving metrics for amino-acid sequences. 
Poster at the 22nd GIF Meeting on Challenges in Genomic Research: 
Neurodegenerative Diseases, Stem Cells, Bioethics, Heidelberg 2002. 

[17] J. Flood. Free Topological Vector Spaces. PhD thesis, Australian Na- 
tional University, Canberra, 1975. 109 pp. 

[18] J. Flood. Free topological vector spaces. Dissertationes Math. (Rozprawy 
Mat), 221:95 pp., 1984. 

[19] E. Giladi, M. G. Walker, J. Z. Wang, and W. Volkmuth. SST: an 
algorithm for finding near-exact sequence matches in time proportional 
to the logarithm of the database size. Bioinformatics, 18(6):873-877, 
2002. 

[20] G. Gonnet, M. Cohen, and S. Benner. Exhaustive matching of the entire 
protein sequence database. Science, 256:1443-1445, 1992. 

[21] O. Gotoh. An improved algorithm for matching biological sequences. J. 
Mol. Biol, 162:705-708, 1982. 

[22] M. I. Graev. Free topological groups. Izvestiya Akad. Nauk SSSR. Ser. 
Mat, 12:279-324, 1948. 

[23] M. I. Graev. Free topological groups. Amer. Math. Soc. Translation, 
1951(35):61, 1951. 



49 



[24] M. Gribskov, A. D. McLachlan, and D. Eiscnberg. Profile analysis: 
detection of distantly related proteins. Proc. Natl. Acad. Sci. U.S.A., 
84:4355-4358, 1987. 

[25] D. Gusfield. Algorithms on Strings, Trees, and Sequences - Computer 
Science and Computational Biology. Cambridge University Press, 1997. 

[26] R. W. Hamming. Error detecting and error correcting codes. Bell System 
Tech. J., 29:147-160, 1950. 

[27] J. M. Hellerstein, E. Koutsoupias, and C. H. Papadimitriou. On the 
analysis of indexing schemes. In Proceedings of the Sixteenth ACM 
SICACT-SICMOD-SIGART Symposium on Principles of Database Sys- 
tems (PODS'97) (Tucson, Arizona, May), pages 249-256, 1997. 

[28] S. Henikoff and J. Henikoff. Amino acid substitution matrices from 
protein blocks. Proc. Natl. Acad. Sci. U.S.A., 89:10915-10919, 1992. 

[29] S. Henikoff and J. G. Henikoff. Performance evaluation of amino acid 
substitution matrices. Proteins, 17(1):49-61, 1993. 

[30] G. R. Hjaltason and H. Samet. Index-driven similarity search in metric 
spaces. ACM Trans. Database Syst, 28(4):517-580, 2003. 

[31] E. Hunt. Indexed Searching on Proteins Using a Suffix Sequoia. IEEE 
Data Eng. Bull, 27:24-31, 2004. 

[32] E. Hunt, M. P. Atkinson, and R. W. Irving. A database index to large 
biological sequences. VLDB J., 11 (3): 139-148, 2001. 

[33] M. Itoh, S. Goto, T. Akutsu, and M. Kanehisa. Fast and accurate 
database homology search using upper bounds of local alignment scores. 

Bioinformatics, 21(7):912-921, 2005. 

[34] D. T. Jones, W. R. Taylor, and J. M. Thornton. The rapid generation of 
mutation data matrices from protein sequences. CABIOS, 8(3):275-282, 
1992. 

[35] D. T. Jones, W. R. Taylor, and J. M. Thornton. A mutation data matrix 
for transmembrane proteins. FEBS Lett, 339(3) :269-275, Feb 1994. 



50 



[36] M. Kann, B. Qian, and R. A. Goldstein. Optimization of a new score 
function for the detection of remote homologs. Proteins^ 41(4):498-503, 
Dec 2000. 

[37] S. Karlin and S. Altschul. Applications and statistics for multiple high- 
scoring segments in molecular sequences. Proc. Natl. Acad. Sci. U.S.A., 
90(12) :5873~5877, 1993. 

[38] S. Karlin and S. F. Altschul. Methods for assessing the statistical signif- 
icance of molecular sequence features by using general scoring schemes. 
Proc. Natl. Acad. Set. U.S.A., 87:2264-2268, 1990. 

[39] K. Katoh, K. Misawa, K.-i. Kuma, and T. Miyata. MAFFT: a novel 
method for rapid multiple sequence alignment based on fast Fourier 
transform. Nucleic Acids Res., 30(14):3059-3066, Jul 2002. 

[40] S. Kawashima, H. Ogata, and M. Kanehisa. AAindex: amino acid index 
database. Nucleic Acids Res., 27:368-369, 1999. 

[41] W. J. Kent. BLAT-the BLAST-like alignment tool. Genome Res., 
12(4):656-664, 2002. 

[42] M. Kschischo, M. Lssig, and Y. Yu. Toward an accurate statistics of 
gapped ahgnments. Bull. Math. Biol, 67(1):169-91, 2005. 

[43] H.-P. A. Kiinzi. Nonsymmetric distances and their associated topologies: 
about the origins of basic ideas in the area of asymmetric topology. In 

Handbook of the history of general topology, Vol. 3, volume 3 of Hist. 
TopoL, pages 853-968. Kluwer Acad. Publ., Dordrecht, 2001. 

[44] H.-P. A. Kiinzi and V. Vajner. Weighted quasi-metrics. In Papers on 
general topology and applications (Flushing, NY, 1992), pages 64-77. 
New York Acad. Sci., New York, 1994. 

[45] T. Lassmann and E. L. L. Sonnhammer. Kalign-an accurate and fast 
multiple sequence alignment algorithm. BMC Bioinformatics, 6:298, 
2005. 

[46] V. I. Levenstein. Binary codes capable of correcting insertions and re- 
versals. Sov. Phys. Dokl, pages 707-710, 1966. 



51 



[47] M. Linial, N. Linial, N. Tishby, and G. Yona. Global self organization 
of all known protein sequences reveals inherent biological signatures. J. 
Mol. Biol, 268:539-556, 1997. 

[48] R. Mao, W. Xu, N. Singh, and D. P. Miranker. An assessment of a 
metric space database index to support sequence homology. In 3rd IEEE 
International Symposium on Biolnformatics and BioEngineering (BIBE 
2003), (Bethesda, Maryland, March 2003), pages 375-384, 2003. 

[49] M. A. Marti-Renom, M. Madhusudhan, and A. Sali. Alignment of pro- 
tein sequences by their profiles. Protein Sci., 13(4):1071-1087, 2004. 

[50] S. G. Matthews. Partial metric topology. In Papers on general topology 
and applications (Flushing, NY, 1992), volume 728 of Ann. New York 
Acad. Set., pages 183-197. New York Acad. Sci., New York, 1994. 

[51] T. Miiller, S. Rahmann, and M. Rehmsmeier. Non-symmetric score 
matrices and the detection of homologous transmembrane proteins. In 
ISMB (Supplement of Bioinformatics), pages 182-189, 2001. 

[52] T. Miiller, R. Spang, and M. Vingron. Estimating Amino Acid Substitu- 
tion Models: A Comparison of DayhofF's Estimator, the Resolvent Ap- 
proach and a Maximum Likelihood Method. Mol. Biol. EvoL, 19(1):8- 
13, 2002. 

[53] T. Miiller and M. Vingron. Modeling amino acid replacement. J. Com- 
put. Biol, 7(6):761-776, 2000. 

[54] K. Nakai, A. Kidera, and M. Kanehisa. Cluster analysis of amino acid 
indices for prediction of protein structure and function. Protein Eng., 
2:93-100, 1988. 

[55] S. Needleman and C. Wunsch. A general method applicable to the search 
for similarities in the amino acid sequence of two proteins. J. Mol. Biol, 
48:443-453, 1970. 

[56] P. C. Ng, J. G. Henikoff, and S. HcnikofT. PHAT: a transmcmbrane- 
specific substitution matrix. Bioinformatics, 16(9):760-766, 2000. 

[57] W. R. Pearson and D. J. Lipman. Improved tools for biological sequence 
analysis. Proc. Natl Acad. Sci. U.S.A., 85:2444-2448, 1988. 



52 



[58] V. Pestov. Douady's conjecture on Banach analytic spaces. C. R. Acad. 
Sci. Paris Ser. I Math., 319(10):1043-1048, 1994. 

[59] V. Pestov. Topological groups: where to from here? Topology Proc, 
24:421-502, 1999. 

[60] V. Pestov. On the geometry of similarity search: dimensionality curse 
and concentration of measure. Inform. Process. Lett., 73:47-51, 2000. 

[61] V. Pestov and A. Stojmirovic. Indexing schemes for similarity search: 
an illustrated paradigm. Fundam. Inform., 70(4):367-385, 2006. 

[62] S. Pietrokovski. Searching databases of conserved sequence regions by 
aligning protein multiple-alignments [published erratum appears in Nu- 
cleic Acids Res 1996 Nov 1;24(21):4372]. Nucl. Acids Res., 24(19):3836- 
3845, 1996. 

[63] A. Prlic, F. S. Domingues, and M. J. Sippl. Structure-derived substitu- 
tion matrices for alignment of distantly related sequences. Protein Eng., 
13(8):545-550, 2000. 

[64] S. Romaguera and M. P. Schellekens. Weightable quasi-metric semi- 
groups and semilattices. Electr. Notes Theor. Comput. Sci, 40, 2000. 

[65] L. Rychlewski, L. Jaroszewski, W. Li, and A. Godzik. Comparison of 
sequence profiles. Strategies for structural predictions using sequence 
information. Protein Sci, 9(2):232-241, 2000. 

[66] O. Sasson, N. Linial, and M. Linial. The metric space of 
proteins-comparative study of clustering algorithms. Bioinformatics, 
18(suppLl):S14-21, 2002. 

[67] P. H. Sellers. On the theory and computation of evolutionary distances. 
SIAM J. Appl. Math., 26:787-793, 1974. 

[68] T. F. Smith and M. S. Waterman. Comparison of biosequences. Adv. 
in Appl. Math., 2(4):482-489, 1981. 

[69] T. F. Smith and M. S. Waterman. Identification of common molecular 
subsequences. J. Mol. Biol, 147:195-197, 1981. 



53 



[70] T. F. Smith, M. S. Waterman, and W. M. Fitch. Comparative biose- 
quence metrics. J. Mol. Evol, 18:38-46, 1981. 

[71] P. A. Spiro and N. Macura. A local alignment metric for accelerating 
biosequence database search. J. Comput. Biol, ll(l):61-82, 2004. 

[72] D. J. States, W. Gish, and S. F. Altschul. Improved sensitivity of nu- 
cleic acid database similarity searches using application specific scoring 
matrices. Methods: A companion to Methods in Enzymology, 3:66-70, 
1991. 

[73] A. Stojmirovic. Quasi- metric spaces with measure. Topology Proc., 
28(2):655-671, 2004. 

[74] A. Stojmirovic and V. Pestov. Indexing schemes for similarity search in 
datasets of short protein fragments. Inf. Syst, 32(8):1145-1165, 2007. 

[75] Z. Tan, X. Cao, B. C. Ooi, and A. K. H. Tung. The ed-tree: an index 
for large dna sequence databases. In SSDBM'2003: Proceedings of the 
15th international conference on Scientific and statistical database man- 
agement, pages 151-160, Washington, DC, USA, 2003. IEEE Computer 
Society. 

[76] J. D. Thompson, D. G. Higgins, and T. J. Gibson. CLUSTAL W: 

improving the sensitivity of progressive multiple sequence alignment 
through sequence weighting, position-specific gap penalties and weight 
matrix choice. Nucleic Acids Res., 22(22) :4673-4680, November 1994. 

[77] K. Tomii and M. Kanehisa. Analysis of amino acid indices and mutation 
matrices for sequence comparison and structure prediction of proteins. 
Protein Eng., 9:27-36, 1996. 

[78] S. Veerassamy, A. Smith, and E. R. M. Tillier. A transition probabil- 
ity model for amino acid substitutions from blocks. J. Comput. Biol, 
10(6):997-1010, 2003. 

[79] P. Vitolo. The representation of weighted quasi-metric spaces. Rend. 
Istit. Mat. Univ. Trieste, 31(l-2):95-100, 1999. 

[80] L. Wang and T. Jiang. On the complexity of multiple sequence align- 
ment. J. Comput. Biol, l(4):337-348, 1994. 



54 



[81] M. S. Waterman. Efficient sequence alignment algoritfims. J. Theor. 
Biol, 108(3) :333-337, 1984. 

[82] M. S. Waterman, T. F. Smitfi, and W. A. Beyer. Some biological se- 
quence metrics. Advances in Math., 20(3):367-387, 1976. 

[83] W. Wu, H. Xiong, and S. Shekhar, editors. Clustering and Information 
Retrieval. Kluwer, 2003. 

[84] G. Yona and M. Levitt. Within the twilight zone: a sensitive profile- 
profile comparison tool based on information theory. J. Mol. Biol., 
315(5):1257-1275, 2002. 

[85] Y.-K. Yu. A metric measure for weight matrices of variable lengths 
- with applications to clustering and classification of hidden Markov 
models. Physica A, 375:212-220, 2007. 

[86] Y.-K. Yu and S. F. Altschul. The construction of amino acid substitution 
matrices for the comparison of proteins with non-standard compositions. 
Bioinformatics, 21(7):902-911, Apr 2005. 

[87] Y.-K. Yu, J. C. Wootton, and S. F. Altschul. The compositional adjust- 
ment of amino acid substitution matrices. Proc. Natl. Acad. Sci. U.S.A., 
100(26):15688-15693, 2003. 



55 



